Hi
I need to compare 2 data sets (old and new) using the keys of USUBJID and VISIT using hashing method. (The real data sets I will be comparing are huge, hence the need for hashing).
I would like to create output data sets that flag observations as NEW, DELETED or UPDATED.
Also if possible I would like to add observations numbers to the output.
Thanks
data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines;
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;
data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines;
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;
This is not really a good use case for using a hash lookup. Is it a must to code this way? And if yes what's the rational?
But first about your sample data: {USUBJID,VISIT} doesn't uniquely identify a row so what's an Update?
100K rows is not that much and it could be done using a data step hash approach. It's just not ideal.
If you also need to detect updates (once you clarified the key variables):
- How many variables do you have in your real tables?
- What size do these tables have uncompressed (or run a proc contents and tell us the number of observations and the observation length).
If there is a primary key then a data step merge or Proc Compare would likely be a better option. Below some sample code using hashes. Not sure if that really matches your requirements but it's what I could come up with based on your reqs and the sample data provided. I was too lazy to also implement your row number wish also because it wouldn't add much value if you can identify a unique key.
proc format;
value change_ind
0 = 'Insert'
1 = 'Update'
2 = 'No change'
. = 'Delete'
;
run;
data want;
if _n_=1 then
do;
dcl hash h_keymatch(dataset:'old');
h_keymatch.defineKey('USUBJID','VISIT');
h_keymatch.defineDone();
dcl hash h_fullmatch(dataset:'old');
h_fullmatch.defineKey('USUBJID','VISIT','date');
h_fullmatch.defineData(all:'y');
h_fullmatch.defineDone();
end;
set new end=_last;
format change_ind change_ind.;
change_ind= (h_keymatch.check()=0) + (h_fullmatch.check()=0);
if change_ind=2 then h_fullmatch.remove();
if _last then h_fullmatch.output(dataset:'work.deletes');
run;
proc append base=want data=deletes;
run;
proc print data=want;
run;
If your data are physically grouped or sorted by USUBJID, then a hash object can be cleared after processing each group, which probably makes the memory demands tolerable.
But put aside that for a moment. If the first dataset had
USUB VISIT DATE
1 screen jan
1 screen mar
and the second had
USUB VISIT DATE
1 screen feb
does that represent 2 deleted and 1 new, or 1 deleted and 1 updated. Remember your proposed keys are USUB/VISIT. DATE is not treated as a key, but rather as a satellite variable.
I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step. Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.
Start by reading this paper.
https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf
Thanks for the reply.
USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".
I wanted to retain the original sort order of the data sets and do a comparison on them.
Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.
@kalbo wrote:
Thanks for the reply.
USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".
I wanted to retain the original sort order of the data sets and do a comparison on them.
Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.
OK, then let's say a given USUBJID/VISIT starts with a JAN record and a MAR record, but the new dataset only has a FEB record. Which of the original records was "updated" and which was "deleted"?
Do you need something like this:
data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines;
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;
data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines;
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;
data _null_;
dcl hash H(ordered:"A");
H.defineKey("USUBJID","VISIT","DATE");
H.defineData("USUBJID","VISIT","DATE","inOLD","inNEW");
H.defineDone();
do until(_E_);
set
OLD(in=in1)
NEW(in=in2)
end=_E_;
rc=H.find();
inOLD+in1;
inNEW+in2;
/*put _ALL_;*/ /* just for preview */
rc=H.replace();
call missing(inOLD,inNEW);
end;
h.Output(dataset:"compare");
stop;
run;
proc print data=compare;
run;
?
Really no need for hash objects, the standard tools of sorting and merging are sufficient. Particularly with extremely small datasets of just 10000 observations.
proc sort data=old;
by usubjid visit date;
run;
proc sort data=new;
by usubjid visit date;
run;
data compare;
merge
old (in=o)
new (in=n)
;
length flag $7;
if o and not n then flag = "DELETED";
else if not o and n then flag = "NEW";
else flag = "UPDATED";
run;
How should "observation numbers" be calculated?
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.