Solved: Hash merge question

spjcdc · Posted 04-02-2014 10:34 AM

I have a very large dataset (69 million rec) that I need to merge with a relatively small dataset (28k recs). I found proc sql took too long so I tried to create a merge using a SAS hash object. I wrote the following code to do that. I wanted to verify that everything was merging OK so I created a counter to increment every time there was a match. I expected the counter to match the number of observations in the smaller dataset. The dataset LiveDoseViolations has 27,542 observations in it but when i run the code the counter has the value 28,428. I don't know how that could be the case nor do i really know how to figure out why. Any ideas?

data LiveVirusValidatedData (drop=LiveDoseViolations) ;

if _N_ = 1

then do;

declare hash h(dataset:'LiveVaxViolations') ;

h.defineKey('cdcid','VaxRecordNum') ;

h.defineData('DoseValidity');

h.defineDone();

length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

call missing(DoseValidity) ;

end;

set FullyDeDupedData end=lastob ;

rc = h.find() ;

if rc = 0 then LiveDoseViolations + 1 ;

if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

run;

The log is below:

4491 +

4492 +data LiveVirusValidatedData (drop=LiveDoseViolations) ;

4493 +

4494 +if _N_ = 1

4495 +then do;

4496 + declare hash h(dataset:'LiveVaxViolations') ;

4497 + h.defineKey('cdcid','VaxRecordNum') ;

4498 + h.defineData('DoseValidity');

4499 + h.defineDone();

4500 +length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

4501 + call missing(DoseValidity) ;

4502 + end;

4503 +set FullyDeDupedData end=lastob ;

4504 +rc = h.find() ;

4505 +if rc = 0 then LiveDoseViolations + 1 ;

4506 +if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

4507 +run;

NOTE: Numeric values have been converted to character values at the places given by:

(Line):(Column).

4506:49

NOTE: There were 27542 observations read from the data set WORK.LIVEVAXVIOLATIONS.

NOTE: There were 69646012 observations read from the data set WORK.FULLYDEDUPEDDATA.

NOTE: The data set WORK.LIVEVIRUSVhasALIDATEDDATA has 69646012 observations and 86 variables.

NOTE: Compressing data set WORK.LIVEVIRUSVALIDATEDDATA decreased size by 73.23 percent.

Compressed is 932110 pages; un-compressed would require 3482302 pages.

NOTE: DATA statement used (Total process time):

real time 6:41.41

cpu time 6:37.80

4508 +

4509 +%put NOTE: there were &LiveDoseViolations Records found with Live Virus Violations ;

NOTE: there were 28428 Records found with Live Virus Violations

4510 +%put Note: This should match the # of observations in Work.LiveVaxViolations ;

Note: This should match the # of observations in Work.LiveVaxViolations

4511 +

AhmedAl_Attar · Posted 04-02-2014 11:04 AM

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

View solution in original post

RW9 · Posted 04-02-2014 10:47 AM

Sorry, I don't use these hash items, however if you pop in your sql code there could be optimizations to be had on that side. E.g. create an intermediary dataset from the big one with only ids from the small one, then merge that on.

spjcdc · Posted 04-02-2014 11:10 AM

Thanks RW9. I only know enough SQL to be dangerous. Here's what I have.

Proc SQL ;

Create Table LiveVirusValidatedData as

Select F.*, L.DoseValidity

from FullyDeDupedData F left join LiveVaxViolations L

on (F.VaxRecordNum = L.VaxRecordNum)

order by CDCID, VaxDate, VaxGroup ;

quit ;

Steve

RW9 · Posted 04-02-2014 11:24 AM

Well, at a quick glance (as I am about to leave) see below to create intermediate dataset, then use that to merge. Also see if there are any other restrictions you could put in place, maybe summing data up, or transposing the data.

proc sql;

/* Create an intermediary dataset with only the relevant data, i.e. has a match and only contains columns we need */
create table inter as
select vaxrecordnum,
          dosevalidity
from    livevaxviolations
where   vaxrecordnum in (select distinct vaxrecordnum from fullydedupeddate);

create table livevirusvalidateddata as
select f.*,
          l.dosevalidity
from    fullydedupeddata f
left join inter l
on      f.vaxrecordnum = l.vaxrecordnum
order by cdcid,
            vaxdate,
            vaxgroup;
quit;

AhmedAl_Attar · Posted 04-02-2014 11:04 AM

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

spjcdc · Posted 04-02-2014 11:07 AM

Thanks Ahmed,

There are not SUPPOSED to be duplicate values of CDCID and VaxRecordNum but apparently there are. So that's not good news. At least that gives me something to pursue. Thanks.

Steve

Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Catch up on SAS Innovate 2026

Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Re: Hash merge question

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away