Hey Folks,
I have got two datasets, Dataset 1 and Dataset 2 with the following variables.
Account_id Markets Sales Date
110032A Medium 200 20/11/2013
110032A Small 150 21/11/2013
110032A Large 250 22/11/2013
110021B Large 100 20/11/20013
110022C Medium 230 25/11/2013
----and so on 25 million large dataset
The Dataset 2 also has the same variables as Dataset 1 and the requirement is that the records and values must match Dataset 1 as its a merely a rewrite from old process. However, I find there are more records in my Dataset 2 and I would like to make a comparison with dataset 1 to see and identity those records that contribute to excess in dataset2. There could some new account_id, or there could be new markets for the some existing account ids. The idea is to exclude the new records and create a separate report.
I tried proc compare, but of no use. I would appreciate a datastep pr proc sql logic. Thanks.
Mark
NOTE: The account_id s are not sorted
I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.
So I suggest you try the following:
data old new;
if _n_=1 then do;
declare hash myhash(dataset: 'dataset1');
myhash.definekey('account_id', 'markets', 'sales', 'date');
/* myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */
myhash.definedone();
end;
set dataset2;
if myhash.find()=0 then output old;
else output new;
run;
Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.
logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)
otherwise output new
Vincent
I think proc compare requires each set to be sorted but don't take my word on this. I don't know what hardware you are working with but likely, 32 length or so variables could fit in the standard instal of -MEMSIZE 2G to use hash table to save proc sql from achieving a full sort on each set before joining.
So I suggest you try the following:
data old new;
if _n_=1 then do;
declare hash myhash(dataset: 'dataset1');
myhash.definekey('account_id', 'markets', 'sales', 'date');
/* myhash.definedata(); shouldn't need to definedata unless dataset1 isn't fully included in dataset2 and you wish to create a 3rd dataset for those obscure records */
myhash.definedone();
end;
set dataset2;
if myhash.find()=0 then output old;
else output new;
run;
Provided you have enough room in memory to load such a big hash table (so technically if you really only have 2 strings and 2 numeric variables, 32 bytes * 25M records < 1.9G or so that is available to hashing in standard install conditions), this should save you a lot of time from sql joins.
logic is if record from table2 is found in table1 you output old (all methods return code 0 if they are succesful)
otherwise output new
Vincent
Hi Vincent,
Thank you so much indeed for the very quick response in providing me the much needed help. Your solution worked amazingly well and the performance was simply great. And sorry for the delayed reply reply from me as I only got the chance to run your solution couple of hours ago.
Much Appreciate it.
Mark
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.