BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MarkWik
Quartz | Level 8

Hey Folks,

I have got two datasets, Dataset 1 and Dataset 2  with the following variables.

Account_id  Markets  Sales      Date

110032A      Medium   200     20/11/2013

110032A       Small     150     21/11/2013

110032A      Large       250     22/11/2013

110021B       Large      100     20/11/20013

110022C      Medium    230     25/11/2013

----and so on 25 million large dataset

The Dataset 2 also has the same variables as Dataset 1 and the requirement is that the records and values must match Dataset 1 as its a merely a rewrite from old process. However, I find there are more records in  my Dataset 2 and I would like to make a comparison with dataset 1 to see and identity those records that contribute to excess in dataset2.  There could some new account_id, or there could be new markets for the some existing account ids. The idea is to exclude the new records and create a separate report.

I tried proc compare, but of no use. I would appreciate a datastep pr proc sql logic. Thanks.

Mark

NOTE: The account_id s are not sorted

1 ACCEPTED SOLUTION

Accepted Solutions
Vince28_Statcan
Quartz | Level 8

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

     if _n_=1 then do;

          declare hash myhash(dataset: 'dataset1');

          myhash.definekey('account_id', 'markets', 'sales', 'date');

          /*     myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

          myhash.definedone();

     end;

     set dataset2;

     if myhash.find()=0 then output old;

     else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

View solution in original post

2 REPLIES 2
Vince28_Statcan
Quartz | Level 8

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

     if _n_=1 then do;

          declare hash myhash(dataset: 'dataset1');

          myhash.definekey('account_id', 'markets', 'sales', 'date');

          /*     myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

          myhash.definedone();

     end;

     set dataset2;

     if myhash.find()=0 then output old;

     else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

MarkWik
Quartz | Level 8

Hi Vincent,

Thank you so much indeed for the very quick response in providing me the much needed help. Your solution worked amazingly well and the performance was simply great. And sorry for the delayed reply reply from me  as I only got the chance to run your solution couple of hours ago.

Much Appreciate it.

Mark

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 416 views
  • 1 like
  • 2 in conversation