BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Steelers_In_DC
Barite | Level 11

I'm back with another one today, simple problem that a large dataset is making difficult.  I have est. 100 mm records that I'll post a section of below.  The first column is customer accounts, the second represents a date.  I'd like to find if there are any duplicate customer account numbers within a specific date.  I tried a proc freq but the dataset is too large.  Is retain  first.customer a solution?  I've been trying several things but coming up with nothing good:

19660719240499
19656740940499
19638469940499
19665415240499
19652444440499
19637047440499
19658065940499
1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

   set cus_hsd;

   by ver_start_day_key cust_acct_key;

   count + 1;

   if last.cust_acct_key;

   if count > 1 then output;

   count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

View solution in original post

7 REPLIES 7
Steelers_In_DC
Barite | Level 11

in addition, the dataset covers 20 months.

Steelers_In_DC
Barite | Level 11

One last thing, here was my initial idea:

proc sql;

create table cus_hsd as

select ver_start_day_key, cust_acct_key

from prd_sas.sas_cus_hsd_source_master

order by ver_start_day_key, cust_acct_key;

proc freq data=cus_hsd;

table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;

data cus_hsd_final;

set cus_hsd_freq;

where count > 1;

run;

Astounding
PROC Star

If your PROC SQL step is working, this would be a better way to follow up:

data cus_hsd_final;

   set cus_hsd;

   by ver_start_day_key cust_acct_key;

   count + 1;

   if last.cust_acct_key;

   if count > 1 then output;

   count=0;

run;

No PROC FREQ needed, as the DATA step can count.

Good luck.

Steelers_In_DC
Barite | Level 11

That is incredibly fast as well, thank you very much. 

Steelers_In_DC
Barite | Level 11

If you wouldn't mind, can you take a moment to explain?  I understand what the count > 1 is doing, but not why the count=0 is necessary.

Thanks,

Astounding
PROC Star

count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer.  Each block of records (day + customer = a block) should begin with count=0.  It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records.  This DATA step would have worked equally well, but would have needed to check for a block beginning:

data cust_hsd_final;

   set cust_hsd;

   by ver_start_day_key cust_acct_key;

   if first.cust_acct_key then count=1;

   else count + 1;

   if last.cust_acct_key and count > 1;

run;

However, it would have been a little bit slower because of the need to check for first.cust_acct_key.

Steelers_In_DC
Barite | Level 11

Awesome, thanks so much.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 719 views
  • 3 likes
  • 2 in conversation