Solved: finding duplicate records within time frames.

Steelers_In_DC · Posted 09-27-2012 03:17 PM

I'm back with another one today, simple problem that a large dataset is making difficult. I have est. 100 mm records that I'll post a section of below. The first column is customer accounts, the second represents a date. I'd like to find if there are any duplicate customer account numbers within a specific date. I tried a proc freq but the dataset is too large. Is retain first.customer a solution? I've been trying several things but coming up with nothing good:

196607192	40499
196567409	40499
196384699	40499
196654152	40499
196524444	40499
196370474	40499
196580659	40499

Astounding · Posted 09-27-2012 03:47 PM

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

set cus_hsd;

by ver_start_day_key cust_acct_key;

count + 1;

if last.cust_acct_key;

if count > 1 then output;

count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

View solution in original post

Steelers_In_DC · Posted 09-27-2012 03:17 PM

in addition, the dataset covers 20 months.

Steelers_In_DC · Posted 09-27-2012 03:28 PM

One last thing, here was my initial idea:

proc sql;

create table cus_hsd as

select ver_start_day_key, cust_acct_key

from prd_sas.sas_cus_hsd_source_master

order by ver_start_day_key, cust_acct_key;

proc freq data=cus_hsd;

table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;

data cus_hsd_final;

set cus_hsd_freq;

where count > 1;

run;

Astounding · Posted 09-27-2012 03:47 PM

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

set cus_hsd;

by ver_start_day_key cust_acct_key;

count + 1;

if last.cust_acct_key;

if count > 1 then output;

count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

Steelers_In_DC · Posted 09-27-2012 03:52 PM

That is incredibly fast as well, thank you very much.

Steelers_In_DC · Posted 09-27-2012 03:57 PM

If you wouldn't mind, can you take a moment to explain? I understand what the count > 1 is doing, but not why the count=0 is necessary.

Thanks,

Astounding · Posted 09-27-2012 04:07 PM

count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer. Each block of records (day + customer = a block) should begin with count=0. It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records. This DATA step would have worked equally well, but would have needed to check for a block beginning:

data cust_hsd_final;

set cust_hsd;

by ver_start_day_key cust_acct_key;

if first.cust_acct_key then count=1;

else count + 1;

if last.cust_acct_key and count > 1;

run;

However, it would have been a little bit slower because of the need to check for first.cust_acct_key.

Steelers_In_DC · Posted 09-27-2012 04:11 PM

Awesome, thanks so much.

finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Re: finding duplicate records within time frames.

Classroom Training Available!