Request for logical dataset comparison help?

Accepted Solution Solved
Reply
Contributor
Posts: 72
Accepted Solution

Request for logical dataset comparison help?

Hey Folks,

I have got two datasets, Dataset 1 and Dataset 2  with the following variables.

Account_id  Markets  Sales      Date

110032A      Medium   200     20/11/2013

110032A       Small     150     21/11/2013

110032A      Large       250     22/11/2013

110021B       Large      100     20/11/20013

110022C      Medium    230     25/11/2013

----and so on 25 million large dataset

The Dataset 2 also has the same variables as Dataset 1 and the requirement is that the records and values must match Dataset 1 as its a merely a rewrite from old process. However, I find there are more records in  my Dataset 2 and I would like to make a comparison with dataset 1 to see and identity those records that contribute to excess in dataset2.  There could some new account_id, or there could be new markets for the some existing account ids. The idea is to exclude the new records and create a separate report.

I tried proc compare, but of no use. I would appreciate a datastep pr proc sql logic. Thanks.

Mark

NOTE: The account_id s are not sorted


Accepted Solutions
Solution
‎12-09-2013 01:00 PM
Super Contributor
Posts: 339

Re: Request for logical dataset comparison help?

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

     if _n_=1 then do;

          declare hash myhash(dataset: 'dataset1');

          myhash.definekey('account_id', 'markets', 'sales', 'date');

          /*     myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

          myhash.definedone();

     end;

     set dataset2;

     if myhash.find()=0 then output old;

     else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

View solution in original post


All Replies
Solution
‎12-09-2013 01:00 PM
Super Contributor
Posts: 339

Re: Request for logical dataset comparison help?

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

     if _n_=1 then do;

          declare hash myhash(dataset: 'dataset1');

          myhash.definekey('account_id', 'markets', 'sales', 'date');

          /*     myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

          myhash.definedone();

     end;

     set dataset2;

     if myhash.find()=0 then output old;

     else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

Contributor
Posts: 72

Re: Request for logical dataset comparison help?

Hi Vincent,

Thank you so much indeed for the very quick response in providing me the much needed help. Your solution worked amazingly well and the performance was simply great. And sorry for the delayed reply reply from me  as I only got the chance to run your solution couple of hours ago.

Much Appreciate it.

Mark

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 188 views
  • 1 like
  • 2 in conversation