BookmarkSubscribeRSS Feed
kalbo
Obsidian | Level 7

Hi

 

I need to compare 2 data sets (old and new) using the keys of USUBJID and VISIT using hashing method. (The real data sets I will be comparing are huge, hence the need for hashing).

 

I would like to create output data sets that flag observations as NEW, DELETED or UPDATED.

 

Also if possible I would like to add observations numbers to the output.

 

Thanks

 

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;
9 REPLIES 9
quickbluefish
Barite | Level 11
I could very well be wrong, but I have never understood hashing to be a good solution when the data are huge, as this would mean loading the entire dataset, 'NEW', into memory.
How 'huge' are we talking?
kalbo
Obsidian | Level 7
There would be 10000 rows estimate. I have already been asked to make a hash object. Thanks.
Patrick
Opal | Level 21

This is not really a good use case for using a hash lookup. Is it a must to code this way? And if yes what's the rational?
But first about your sample data: {USUBJID,VISIT} doesn't uniquely identify a row so what's an Update?

 

100K rows is not that much and it could be done using a data step hash approach. It's just not ideal.

If you also need to detect updates (once you clarified the key variables):

- How many variables do you have in your real tables?
- What size do these tables have uncompressed (or run a proc contents and tell us the number of observations and the observation length).

 

If there is a primary key then a data step merge or Proc Compare would likely be a better option. Below some sample code using hashes. Not sure if that really matches your requirements but it's what I could come up with based on your reqs and the sample data provided. I was too lazy to also implement your row number wish also because it wouldn't add much value if you can identify a unique key.

proc format;
	value change_ind
		0 = 'Insert'
		1 = 'Update'
		2 = 'No change'
		. = 'Delete'
	;
run;

data want;
	if _n_=1 then
		do;
			dcl hash h_keymatch(dataset:'old');
			h_keymatch.defineKey('USUBJID','VISIT');
			h_keymatch.defineDone();
			dcl hash h_fullmatch(dataset:'old');
			h_fullmatch.defineKey('USUBJID','VISIT','date');
			h_fullmatch.defineData(all:'y');
			h_fullmatch.defineDone();
		end;
	set new end=_last;
	format change_ind change_ind.;
	change_ind= (h_keymatch.check()=0) + (h_fullmatch.check()=0);
	if change_ind=2 then h_fullmatch.remove();
	if _last then h_fullmatch.output(dataset:'work.deletes');
run;

proc append base=want data=deletes;
run;

proc print data=want;
run;

Patrick_0-1737859371329.png

 

mkeintz
PROC Star

If your data are physically grouped or sorted by USUBJID, then a hash object can be cleared after processing each group, which probably makes the memory demands tolerable.

 

But put aside that for a moment.  If the first dataset had 

USUB   VISIT  DATE

1            screen  jan

1            screen  mar

 

and the second had

USUB  VISIT DATE

1         screen feb

 

does that represent 2 deleted and 1 new,   or 1 deleted and 1 updated.  Remember your proposed keys are USUB/VISIT.  DATE is not treated as a key, but rather as a satellite variable.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step.  Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.

 

Start by reading this paper.

https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf

 

kalbo
Obsidian | Level 7

Thanks for the reply.

 

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

 

I wanted to retain the original sort order of the data sets and do a comparison on them.

 

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.

mkeintz
PROC Star

@kalbo wrote:

Thanks for the reply.

 

USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".

 

I wanted to retain the original sort order of the data sets and do a comparison on them.

 

Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.


OK, then let's say a given USUBJID/VISIT starts with a JAN record and a MAR record, but the new dataset only has a FEB record.  Which of the original records was "updated" and which was "deleted"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
yabwon
Onyx | Level 15

Do you need something like this:

data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

data _null_;
dcl hash H(ordered:"A");
H.defineKey("USUBJID","VISIT","DATE");
H.defineData("USUBJID","VISIT","DATE","inOLD","inNEW");
H.defineDone();

do until(_E_);
  set 
    OLD(in=in1) 
    NEW(in=in2) 
  end=_E_;

  rc=H.find();
  inOLD+in1;
  inNEW+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inOLD,inNEW);
end;


h.Output(dataset:"compare");
stop;
run;

proc print data=compare;
run;

?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Kurt_Bremser
Super User

Really no need for hash objects, the standard tools of sorting and merging are sufficient. Particularly with extremely small datasets of just 10000 observations.

proc sort data=old;
by usubjid visit date;
run;

proc sort data=new;
by usubjid visit date;
run;

data compare;
merge
  old (in=o)
  new (in=n)
;
length flag $7;
if o and not n then flag = "DELETED";
else if not o and n then flag = "NEW";
else flag = "UPDATED";
run;

How should "observation numbers" be calculated?

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1220 views
  • 1 like
  • 7 in conversation