<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Comparing 2 data sets using hashing in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957242#M373700</link>
    <description>&lt;P&gt;I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step.&amp;nbsp; Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Start by reading this paper.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf" target="_blank"&gt;https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 26 Jan 2025 05:03:30 GMT</pubDate>
    <dc:creator>Tom</dc:creator>
    <dc:date>2025-01-26T05:03:30Z</dc:date>
    <item>
      <title>Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957226#M373693</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I need to compare 2 data sets (old and new) using the keys of USUBJID and VISIT using hashing method. (The real data sets I will be comparing are huge, hence the need for hashing).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I would like to create output data sets that flag observations as NEW, DELETED or UPDATED.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also if possible I would like to add observations numbers to the output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sat, 25 Jan 2025 15:06:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957226#M373693</guid>
      <dc:creator>kalbo</dc:creator>
      <dc:date>2025-01-25T15:06:18Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957227#M373694</link>
      <description>I could very well be wrong, but I have never understood hashing to be a good solution when the data are huge, as this would mean loading the entire dataset, 'NEW', into memory.  &lt;BR /&gt;How 'huge' are we talking?</description>
      <pubDate>Sat, 25 Jan 2025 15:45:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957227#M373694</guid>
      <dc:creator>quickbluefish</dc:creator>
      <dc:date>2025-01-25T15:45:31Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957232#M373696</link>
      <description>There would be 10000 rows estimate. I have already been asked to make a hash object. Thanks.</description>
      <pubDate>Sat, 25 Jan 2025 18:49:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957232#M373696</guid>
      <dc:creator>kalbo</dc:creator>
      <dc:date>2025-01-25T18:49:11Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957237#M373698</link>
      <description>&lt;P&gt;This is not really a good use case for using a hash lookup. Is it a must to code this way? And if yes what's the rational?&lt;BR /&gt;But first about your sample data: {USUBJID,VISIT} doesn't uniquely identify a row so what's an Update?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;100K rows is not that much and it could be done using a data step hash approach. It's just not ideal.&lt;/P&gt;
&lt;P&gt;If you also need to detect updates (once you clarified the key variables):&lt;/P&gt;
&lt;P&gt;- How many variables do you have in your real tables? &lt;BR /&gt;- What size do these tables have uncompressed (or run a proc contents and tell us the number of observations and the observation length).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If there is a primary key then a data step merge or Proc Compare would likely be a better option. Below some sample code using hashes. Not sure if that really matches your requirements but it's what I could come up with based on your reqs and the sample data provided. I was too lazy to also implement your row number wish also because it wouldn't add much value if you can identify a unique key.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc format;
	value change_ind
		0 = 'Insert'
		1 = 'Update'
		2 = 'No change'
		. = 'Delete'
	;
run;

data want;
	if _n_=1 then
		do;
			dcl hash h_keymatch(dataset:'old');
			h_keymatch.defineKey('USUBJID','VISIT');
			h_keymatch.defineDone();
			dcl hash h_fullmatch(dataset:'old');
			h_fullmatch.defineKey('USUBJID','VISIT','date');
			h_fullmatch.defineData(all:'y');
			h_fullmatch.defineDone();
		end;
	set new end=_last;
	format change_ind change_ind.;
	change_ind= (h_keymatch.check()=0) + (h_fullmatch.check()=0);
	if change_ind=2 then h_fullmatch.remove();
	if _last then h_fullmatch.output(dataset:'work.deletes');
run;

proc append base=want data=deletes;
run;

proc print data=want;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Patrick_0-1737859371329.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/104002i5CC497158DF4991A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Patrick_0-1737859371329.png" alt="Patrick_0-1737859371329.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 02:44:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957237#M373698</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2025-01-26T02:44:38Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957239#M373699</link>
      <description>&lt;P&gt;If your data are physically grouped or sorted by USUBJID, then a hash object can be cleared after processing each group, which probably makes the memory demands tolerable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But put aside that for a moment.&amp;nbsp; If the first dataset had&amp;nbsp;&lt;/P&gt;
&lt;P&gt;USUB&amp;nbsp; &amp;nbsp;VISIT&amp;nbsp; DATE&lt;/P&gt;
&lt;P&gt;1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; screen&amp;nbsp; jan&lt;/P&gt;
&lt;P&gt;1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; screen&amp;nbsp; mar&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;and the second had&lt;/P&gt;
&lt;P&gt;USUB&amp;nbsp; VISIT DATE&lt;/P&gt;
&lt;P&gt;1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;screen feb&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;does that represent 2 deleted and 1 new,&amp;nbsp; &amp;nbsp;or 1 deleted and 1 updated.&amp;nbsp; Remember your proposed keys are USUB/VISIT.&amp;nbsp; DATE is not treated as a key, but rather as a satellite variable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 03:40:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957239#M373699</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-01-26T03:40:40Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957242#M373700</link>
      <description>&lt;P&gt;I suspect what they are asking your to do is totally different that the HASH object that you can use in a data step.&amp;nbsp; Instead they are probably asking you to generate a hashed value from ALL of the variables in an observation so that you can then compare that (shorter) hash value to test if the observations are different instead of having to compare every value.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Start by reading this paper.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf" target="_blank"&gt;https://support.sas.com/resources/papers/proceedings20/4838-2020.pdf&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 05:03:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957242#M373700</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2025-01-26T05:03:30Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957246#M373703</link>
      <description>&lt;P&gt;Thanks for the reply.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I wanted to retain the original sort order of the data sets and do a comparison on them.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 11:25:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957246#M373703</guid>
      <dc:creator>kalbo</dc:creator>
      <dc:date>2025-01-26T11:25:41Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957247#M373704</link>
      <description>&lt;P&gt;Do you need something like this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data old;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,FEB
R191,VISIT1,MAR
R193,SCREEN,JAN
R193,VISIT1,APR
R194,VISIT4,FEB
;

data new;
infile datalines dsd;
input USUBJID $ VISIT: $8. DATE: $8.;
datalines; 
R191,SCREEN,JAN
R191,SCREEN,SEP
R191,VISIT1,FEB
R191,VISIT2,APR
R193,SCREEN,JAN
R193,VISIT1,MAY
R194,VISIT4,FEB
R195,VISIT1,SEP
;

data _null_;
dcl hash H(ordered:"A");
H.defineKey("USUBJID","VISIT","DATE");
H.defineData("USUBJID","VISIT","DATE","inOLD","inNEW");
H.defineDone();

do until(_E_);
  set 
    OLD(in=in1) 
    NEW(in=in2) 
  end=_E_;

  rc=H.find();
  inOLD+in1;
  inNEW+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inOLD,inNEW);
end;


h.Output(dataset:"compare");
stop;
run;

proc print data=compare;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;?&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 13:45:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957247#M373704</guid>
      <dc:creator>yabwon</dc:creator>
      <dc:date>2025-01-26T13:45:44Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957249#M373705</link>
      <description>&lt;P&gt;Really no need for hash objects, the standard tools of sorting and merging are sufficient. Particularly with extremely small datasets of just 10000 observations.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=old;
by usubjid visit date;
run;

proc sort data=new;
by usubjid visit date;
run;

data compare;
merge
  old (in=o)
  new (in=n)
;
length flag $7;
if o and not n then flag = "DELETED";
else if not o and n then flag = "NEW";
else flag = "UPDATED";
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;How should "observation numbers" be calculated?&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 20:19:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957249#M373705</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2025-01-26T20:19:25Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing 2 data sets using hashing</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957264#M373710</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/160861"&gt;@kalbo&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Thanks for the reply.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;USUBJID and VISIT are the keys. Changes in DATE are the ones to be flagged as "updated".&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I wanted to retain the original sort order of the data sets and do a comparison on them.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Number of variables is going to be from 3 to 7, depending on number of keys. There will always be one variable to be compared.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;OK, then let's say a given USUBJID/VISIT starts with a JAN record and a MAR record, but the new dataset only has a FEB record.&amp;nbsp; Which of the original records was "updated" and which was "deleted"?&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 23:47:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Comparing-2-data-sets-using-hashing/m-p/957264#M373710</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-01-26T23:47:28Z</dc:date>
    </item>
  </channel>
</rss>

