finding duplicate records within time frames.

Accepted Solution Solved
Reply
Valued Guide
Posts: 860
Accepted Solution

finding duplicate records within time frames.

I'm back with another one today, simple problem that a large dataset is making difficult.  I have est. 100 mm records that I'll post a section of below.  The first column is customer accounts, the second represents a date.  I'd like to find if there are any duplicate customer account numbers within a specific date.  I tried a proc freq but the dataset is too large.  Is retain  first.customer a solution?  I've been trying several things but coming up with nothing good:

19660719240499
19656740940499
19638469940499
19665415240499
19652444440499
19637047440499
19658065940499

Accepted Solutions
Solution
‎09-27-2012 03:47 PM
Super User
Posts: 5,503

Re: finding duplicate records within time frames.

Posted in reply to Steelers_In_DC

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

   set cus_hsd;

   by ver_start_day_key cust_acct_key;

   count + 1;

   if last.cust_acct_key;

   if count > 1 then output;

   count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

View solution in original post


All Replies
Valued Guide
Posts: 860

Re: finding duplicate records within time frames.

Posted in reply to Steelers_In_DC

in addition, the dataset covers 20 months.

Valued Guide
Posts: 860

Re: finding duplicate records within time frames.

Posted in reply to Steelers_In_DC

One last thing, here was my initial idea:

proc sql;

create table cus_hsd as

select ver_start_day_key, cust_acct_key

from prd_sas.sas_cus_hsd_source_master

order by ver_start_day_key, cust_acct_key;

proc freq data=cus_hsd;

table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;

data cus_hsd_final;

set cus_hsd_freq;

where count > 1;

run;

Solution
‎09-27-2012 03:47 PM
Super User
Posts: 5,503

Re: finding duplicate records within time frames.

Posted in reply to Steelers_In_DC

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

   set cus_hsd;

   by ver_start_day_key cust_acct_key;

   count + 1;

   if last.cust_acct_key;

   if count > 1 then output;

   count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

Valued Guide
Posts: 860

Re: finding duplicate records within time frames.

Posted in reply to Astounding

That is incredibly fast as well, thank you very much. 

Valued Guide
Posts: 860

Re: finding duplicate records within time frames.

Posted in reply to Astounding

If you wouldn't mind, can you take a moment to explain?  I understand what the count > 1 is doing, but not why the count=0 is necessary.

Thanks,

Super User
Posts: 5,503

Re: finding duplicate records within time frames.

Posted in reply to Steelers_In_DC

count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer.  Each block of records (day + customer = a block) should begin with count=0.  It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records.  This DATA step would have worked equally well, but would have needed to check for a block beginning:

data cust_hsd_final;

   set cust_hsd;

   by ver_start_day_key cust_acct_key;

   if first.cust_acct_key then count=1;

   else count + 1;

   if last.cust_acct_key and count > 1;

run;

However, it would have been a little bit slower because of the need to check for first.cust_acct_key.

Valued Guide
Posts: 860

Re: finding duplicate records within time frames.

Posted in reply to Astounding

Awesome, thanks so much.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 317 views
  • 3 likes
  • 2 in conversation