Solved: Request for logical dataset comparison help?

MarkWik · Posted 12-09-2013 12:48 PM

Hey Folks,

I have got two datasets, Dataset 1 and Dataset 2 with the following variables.

Account_id Markets Sales Date

110032A Medium 200 20/11/2013

110032A Small 150 21/11/2013

110032A Large 250 22/11/2013

110021B Large 100 20/11/20013

110022C Medium 230 25/11/2013

----and so on 25 million large dataset

The Dataset 2 also has the same variables as Dataset 1 and the requirement is that the records and values must match Dataset 1 as its a merely a rewrite from old process. However, I find there are more records in my Dataset 2 and I would like to make a comparison with dataset 1 to see and identity those records that contribute to excess in dataset2. There could some new account_id, or there could be new markets for the some existing account ids. The idea is to exclude the new records and create a separate report.

I tried proc compare, but of no use. I would appreciate a datastep pr proc sql logic. Thanks.

Mark

NOTE: The account_id s are not sorted

Vince28_Statcan · Posted 12-09-2013 01:00 PM

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

if _n_=1 then do;

declare hash myhash(dataset: 'dataset1');

myhash.definekey('account_id', 'markets', 'sales', 'date');

/* myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

myhash.definedone();

end;

set dataset2;

if myhash.find()=0 then output old;

else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

View solution in original post

Vince28_Statcan · Posted 12-09-2013 01:00 PM

I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.

So I suggest you try the following:

data old new;

if _n_=1 then do;

declare hash myhash(dataset: 'dataset1');

myhash.definekey('account_id', 'markets', 'sales', 'date');

/* myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */

myhash.definedone();

end;

set dataset2;

if myhash.find()=0 then output old;

else output new;

run;

Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.

logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)

otherwise output new

Vincent

MarkWik · Posted 12-10-2013 12:05 PM

Hi Vincent,

Thank you so much indeed for the very quick response in providing me the much needed help. Your solution worked amazingly well and the performance was simply great. And sorry for the delayed reply reply from me as I only got the chance to run your solution couple of hours ago.

Much Appreciate it.

Mark

Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Re: Request for logical dataset comparison help?

Click image to register for webinar

Classroom Training Available!