topic Re: Comparing 2 data sets using hashing in SAS Programming

Comparing 2 data sets using hashing

kalbo — Sat, 25 Jan 2025 15:06:18 GMT

I need to compare 2 data sets (old and new) using the keys of USUBJID and VISIT using hashing method. (The real data sets I will be comparing are huge, hence the need for hashing).

I would like to create output data sets that flag observations as NEW, DELETED or UPDATED.

Also if possible I would like to add observations numbers to the output.

Thanks

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

Re: Comparing 2 data sets using hashing

quickbluefish — Sat, 25 Jan 2025 15:45:31 GMT

I could very well be wrong, but I have never understood hashing to be a good solution when the data are huge, as this would mean loading the entire dataset, 'NEW', into memory.
How 'huge' are we talking?

Re: Comparing 2 data sets using hashing

kalbo — Sat, 25 Jan 2025 18:49:11 GMT

There would be 10000 rows estimate. I have already been asked to make a hash object. Thanks.

Re: Comparing 2 data sets using hashing

Patrick — Sun, 26 Jan 2025 02:44:38 GMT

This is not really a good use case for using a hash lookup. Is it a must to code this way? And if yes what's the rational?
But first about your sample data: {USUBJID,VISIT} doesn't uniquely identify a row so what's an Update?

100K rows is not that much and it could be done using a data step hash approach. It's just not ideal.

If you also need to detect updates (once you clarified the key variables):

- How many variables do you have in your real tables?
- What size do these tables have uncompressed (or run a proc contents and tell us the number of observations and the observation length).

If there is a primary key then a data step merge or Proc Compare would likely be a better option. Below some sample code using hashes. Not sure if that really matches your requirements but it's what I could come up with based on your reqs and the sample data provided. I was too lazy to also implement your row number wish also because it wouldn't add much value if you can identify a unique key.

proc format;
	value change_ind
		0 = 'Insert'
		1 = 'Update'
		2 = 'No change'
		. = 'Delete'
	;
run;

data want;
	if _n_=1 then
		do;
			dcl hash h_keymatch(dataset:'old');
			h_keymatch.defineKey('USUBJID','VISIT');
			h_keymatch.defineDone();
			dcl hash h_fullmatch(dataset:'old');
			h_fullmatch.defineKey('USUBJID','VISIT','date');
			h_fullmatch.defineData(all:'y');
			h_fullmatch.defineDone();
		end;
	set new end=_last;
	format change_ind change_ind.;
	change_ind= (h_keymatch.check()=0) + (h_fullmatch.check()=0);
	if change_ind=2 then h_fullmatch.remove();
	if _last then h_fullmatch.output(dataset:'work.deletes');
run;

proc append base=want data=deletes;
run;

proc print data=want;
run;

Re: Comparing 2 data sets using hashing

mkeintz — Sun, 26 Jan 2025 03:40:40 GMT

If your data are physically grouped or sorted by USUBJID, then a hash object can be cleared after processing each group, which probably makes the memory demands tolerable.

But put aside that for a moment. If the first dataset had

USUB VISIT DATE

1 screen jan

1 screen mar

and the second had

USUB VISIT DATE

1 screen feb

does that represent 2 deleted and 1 new, or 1 deleted and 1 updated. Remember your proposed keys are USUB/VISIT. DATE is not treated as a key, but rather as a satellite variable.

Re: Comparing 2 data sets using hashing

Tom — Sun, 26 Jan 2025 05:03:30 GMT

I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step. Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.

Start by reading this paper.

https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf

Re: Comparing 2 data sets using hashing

kalbo — Sun, 26 Jan 2025 11:25:41 GMT

Thanks for the reply.

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

I wanted to retain the original sort order of the data sets and do a comparison on them.

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.

Re: Comparing 2 data sets using hashing

yabwon — Sun, 26 Jan 2025 13:45:44 GMT

Do you need something like this:

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

data _null_;
dcl hash H(ordered:"A");
H.defineKey("USUBJID","VISIT","DATE");
H.defineData("USUBJID","VISIT","DATE","inOLD","inNEW");
H.defineDone();

do until(_E_);
  set 
    OLD(in=in1) 
    NEW(in=in2) 
  end=_E_;

  rc=H.find();
  inOLD+in1;
  inNEW+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inOLD,inNEW);
end;


h.Output(dataset:"compare");
stop;
run;

proc print data=compare;
run;

Re: Comparing 2 data sets using hashing

Kurt_Bremser — Sun, 26 Jan 2025 20:19:25 GMT

Really no need for hash objects, the standard tools of sorting and merging are sufficient. Particularly with extremely small datasets of just 10000 observations.

proc sort data=old;
by usubjid visit date;
run;

proc sort data=new;
by usubjid visit date;
run;

data compare;
merge
  old (in=o)
  new (in=n)
;
length flag $7;
if o and not n then flag = "DELETED";
else if not o and n then flag = "NEW";
else flag = "UPDATED";
run;

How should "observation numbers" be calculated?

Re: Comparing 2 data sets using hashing

mkeintz — Sun, 26 Jan 2025 23:47:28 GMT

@kalbo wrote:

Thanks for the reply.

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

I wanted to retain the original sort order of the data sets and do a comparison on them.

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.

OK, then let's say a given USUBJID/VISIT starts with a JAN record and a MAR record, but the new dataset only has a FEB record. Which of the original records was "updated" and which was "deleted"?