Comparing 2 data sets using hashing

kalbo · Posted 01-25-2025 10:06 AM

Hi

I need to compare 2 data sets (old and new) using the keys of USUBJID and VISIT using hashing method. (The real data sets I will be comparing are huge, hence the need for hashing).

I would like to create output data sets that flag observations as NEW, DELETED or UPDATED.

Also if possible I would like to add observations numbers to the output.

Thanks

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

quickbluefish · Posted 01-25-2025 10:45 AM

I could very well be wrong, but I have never understood hashing to be a good solution when the data are huge, as this would mean loading the entire dataset, 'NEW', into memory.
How 'huge' are we talking?

kalbo · Posted 01-25-2025 01:49 PM

There would be 10000 rows estimate. I have already been asked to make a hash object. Thanks.

Patrick · Posted 01-25-2025 09:27 PM

This is not really a good use case for using a hash lookup. Is it a must to code this way? And if yes what's the rational?
But first about your sample data: {USUBJID,VISIT} doesn't uniquely identify a row so what's an Update?

100K rows is not that much and it could be done using a data step hash approach. It's just not ideal.

If you also need to detect updates (once you clarified the key variables):

- How many variables do you have in your real tables?
- What size do these tables have uncompressed (or run a proc contents and tell us the number of observations and the observation length).

If there is a primary key then a data step merge or Proc Compare would likely be a better option. Below some sample code using hashes. Not sure if that really matches your requirements but it's what I could come up with based on your reqs and the sample data provided. I was too lazy to also implement your row number wish also because it wouldn't add much value if you can identify a unique key.

proc format;
	value change_ind
		0 = 'Insert'
		1 = 'Update'
		2 = 'No change'
		. = 'Delete'
	;
run;

data want;
	if _n_=1 then
		do;
			dcl hash h_keymatch(dataset:'old');
			h_keymatch.defineKey('USUBJID','VISIT');
			h_keymatch.defineDone();
			dcl hash h_fullmatch(dataset:'old');
			h_fullmatch.defineKey('USUBJID','VISIT','date');
			h_fullmatch.defineData(all:'y');
			h_fullmatch.defineDone();
		end;
	set new end=_last;
	format change_ind change_ind.;
	change_ind= (h_keymatch.check()=0) + (h_fullmatch.check()=0);
	if change_ind=2 then h_fullmatch.remove();
	if _last then h_fullmatch.output(dataset:'work.deletes');
run;

proc append base=want data=deletes;
run;

proc print data=want;
run;

mkeintz · Posted 01-25-2025 09:59 PM

If your data are physically grouped or sorted by USUBJID, then a hash object can be cleared after processing each group, which probably makes the memory demands tolerable.

But put aside that for a moment. If the first dataset had

USUB VISIT DATE

1 screen jan

1 screen mar

and the second had

USUB VISIT DATE

1 screen feb

does that represent 2 deleted and 1 new, or 1 deleted and 1 updated. Remember your proposed keys are USUB/VISIT. DATE is not treated as a key, but rather as a satellite variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 01-26-2025 12:03 AM

I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step. Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.

Start by reading this paper.

https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf

kalbo · Posted 01-26-2025 06:25 AM

Thanks for the reply.

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

I wanted to retain the original sort order of the data sets and do a comparison on them.

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.

mkeintz · Posted 01-26-2025 06:47 PM

@kalbo wrote:

Thanks for the reply.

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

I wanted to retain the original sort order of the data sets and do a comparison on them.

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.

OK, then let's say a given USUBJID/VISIT starts with a JAN record and a MAR record, but the new dataset only has a FEB record. Which of the original records was "updated" and which was "deleted"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yabwon · Posted 01-26-2025 08:45 AM

Do you need something like this:

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

data _null_;
dcl hash H(ordered:"A");
H.defineKey("USUBJID","VISIT","DATE");
H.defineData("USUBJID","VISIT","DATE","inOLD","inNEW");
H.defineDone();

do until(_E_);
  set 
    OLD(in=in1) 
    NEW(in=in2) 
  end=_E_;

  rc=H.find();
  inOLD+in1;
  inNEW+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inOLD,inNEW);
end;


h.Output(dataset:"compare");
stop;
run;

proc print data=compare;
run;

?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser · Posted 01-26-2025 03:19 PM

Really no need for hash objects, the standard tools of sorting and merging are sufficient. Particularly with extremely small datasets of just 10000 observations.

proc sort data=old;
by usubjid visit date;
run;

proc sort data=new;
by usubjid visit date;
run;

data compare;
merge
  old (in=o)
  new (in=n)
;
length flag $7;
if o and not n then flag = "DELETED";
else if not o and n then flag = "NEW";
else flag = "UPDATED";
run;

How should "observation numbers" be calculated?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Re: Comparing 2 data sets using hashing

Registration is open

SAS Training: Just a Click Away