I'm back with another one today, simple problem that a large dataset is making difficult. I have est. 100 mm records that I'll post a section of below. The first column is customer accounts, the second represents a date. I'd like to find if there are any duplicate customer account numbers within a specific date. I tried a proc freq but the dataset is too large. Is retain first.customer a solution? I've been trying several things but coming up with nothing good:
196607192 | 40499 |
196567409 | 40499 |
196384699 | 40499 |
196654152 | 40499 |
196524444 | 40499 |
196370474 | 40499 |
196580659 | 40499 |
If your PROC SQL step is working, this would be a better way to follow up:
data cus_hsd_final;
set cus_hsd;
by ver_start_day_key cust_acct_key;
count + 1;
if last.cust_acct_key;
if count > 1 then output;
count=0;
run;
No PROC FREQ needed, as the DATA step can count.
Good luck.
in addition, the dataset covers 20 months.
One last thing, here was my initial idea:
proc sql;
create table cus_hsd as
select ver_start_day_key, cust_acct_key
from prd_sas.sas_cus_hsd_source_master
order by ver_start_day_key, cust_acct_key;
proc freq data=cus_hsd;
table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;
data cus_hsd_final;
set cus_hsd_freq;
where count > 1;
run;
If your PROC SQL step is working, this would be a better way to follow up:
data cus_hsd_final;
set cus_hsd;
by ver_start_day_key cust_acct_key;
count + 1;
if last.cust_acct_key;
if count > 1 then output;
count=0;
run;
No PROC FREQ needed, as the DATA step can count.
Good luck.
That is incredibly fast as well, thank you very much.
If you wouldn't mind, can you take a moment to explain? I understand what the count > 1 is doing, but not why the count=0 is necessary.
Thanks,
count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer. Each block of records (day + customer = a block) should begin with count=0. It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records. This DATA step would have worked equally well, but would have needed to check for a block beginning:
data cust_hsd_final;
set cust_hsd;
by ver_start_day_key cust_acct_key;
if first.cust_acct_key then count=1;
else count + 1;
if last.cust_acct_key and count > 1;
run;
However, it would have been a little bit slower because of the need to check for first.cust_acct_key.
Awesome, thanks so much.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.