I'm back with another one today, simple problem that a large dataset is making difficult. I have est. 100 mm records that I'll post a section of below. The first column is customer accounts, the second represents a date. I'd like to find if there are any duplicate customer account numbers within a specific date. I tried a proc freq but the dataset is too large. Is retain first.customer a solution? I've been trying several things but coming up with nothing good:
196607192 | 40499 |
196567409 | 40499 |
196384699 | 40499 |
196654152 | 40499 |
196524444 | 40499 |
196370474 | 40499 |
196580659 | 40499 |
If your PROC SQL step is working, this would be a better way to follow up:
data cus_hsd_final;
set cus_hsd;
by ver_start_day_key cust_acct_key;
count + 1;
if last.cust_acct_key;
if count > 1 then output;
count=0;
run;
No PROC FREQ needed, as the DATA step can count.
Good luck.
in addition, the dataset covers 20 months.
One last thing, here was my initial idea:
proc sql;
create table cus_hsd as
select ver_start_day_key, cust_acct_key
from prd_sas.sas_cus_hsd_source_master
order by ver_start_day_key, cust_acct_key;
proc freq data=cus_hsd;
table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;
data cus_hsd_final;
set cus_hsd_freq;
where count > 1;
run;
If your PROC SQL step is working, this would be a better way to follow up:
data cus_hsd_final;
set cus_hsd;
by ver_start_day_key cust_acct_key;
count + 1;
if last.cust_acct_key;
if count > 1 then output;
count=0;
run;
No PROC FREQ needed, as the DATA step can count.
Good luck.
That is incredibly fast as well, thank you very much.
If you wouldn't mind, can you take a moment to explain? I understand what the count > 1 is doing, but not why the count=0 is necessary.
Thanks,
count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer. Each block of records (day + customer = a block) should begin with count=0. It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records. This DATA step would have worked equally well, but would have needed to check for a block beginning:
data cust_hsd_final;
set cust_hsd;
by ver_start_day_key cust_acct_key;
if first.cust_acct_key then count=1;
else count + 1;
if last.cust_acct_key and count > 1;
run;
However, it would have been a little bit slower because of the need to check for first.cust_acct_key.
Awesome, thanks so much.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.