DATA Step, Macro, Functions and more

Hash merge question

Accepted Solution Solved
Reply
Contributor
Posts: 27
Accepted Solution

Hash merge question

I have a very large dataset (69 million rec) that I need to merge with a relatively small dataset (28k recs).  I found proc sql took too long so I tried to create a merge using a SAS hash object.  I wrote the following code to do that.  I wanted to verify that everything was merging OK so I created a counter to increment every time there was a match.  I expected the counter to match the number of observations in the smaller dataset.  The dataset LiveDoseViolations has 27,542 observations in it but when i run the code the counter has the value 28,428.  I don't know how that could be the case nor do i really know how to figure out why.  Any ideas?

data LiveVirusValidatedData (drop=LiveDoseViolations) ;

if _N_ = 1

then do;

     declare hash h(dataset:'LiveVaxViolations') ;

     h.defineKey('cdcid','VaxRecordNum') ;

     h.defineData('DoseValidity');

     h.defineDone();

     length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

     call missing(DoseValidity) ;

     end;

set FullyDeDupedData end=lastob ;

rc = h.find() ;

if rc = 0 then LiveDoseViolations + 1 ;

if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

run;

The log is below:

4491 +

4492 +data LiveVirusValidatedData (drop=LiveDoseViolations) ;

4493 +

4494 +if _N_ = 1

4495 +then do;

4496 +     declare hash h(dataset:'LiveVaxViolations') ;

4497 +     h.defineKey('cdcid','VaxRecordNum') ;

4498 +     h.defineData('DoseValidity');

4499 +     h.defineDone();

4500 +length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

4501 +     call missing(DoseValidity) ;

4502 +     end;

4503 +set FullyDeDupedData end=lastob ;

4504 +rc = h.find() ;

4505 +if rc = 0 then LiveDoseViolations + 1 ;

4506 +if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

4507 +run;

NOTE: Numeric values have been converted to character values at the places given by:

      (Line)Smiley SadColumn).

      4506:49  

NOTE: There were 27542 observations read from the data set WORK.LIVEVAXVIOLATIONS.

NOTE: There were 69646012 observations read from the data set WORK.FULLYDEDUPEDDATA.

NOTE: The data set WORK.LIVEVIRUSVhasALIDATEDDATA has 69646012 observations and 86 variables.

NOTE: Compressing data set WORK.LIVEVIRUSVALIDATEDDATA decreased size by 73.23 percent.

      Compressed is 932110 pages; un-compressed would require 3482302 pages.

NOTE: DATA statement used (Total process time):

      real time           6:41.41

      cpu time            6:37.80

4508 +

4509 +%put NOTE: there were &LiveDoseViolations Records found with Live Virus Violations ;

NOTE: there were        28428 Records found with Live Virus Violations

4510 +%put Note: This should match the # of observations in Work.LiveVaxViolations ;

Note: This should match the # of observations in Work.LiveVaxViolations

4511 +


Accepted Solutions
Solution
‎04-02-2014 11:04 AM
Regular Contributor
Posts: 216

Re: Hash merge question

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

View solution in original post


All Replies
Super User
Super User
Posts: 7,951

Re: Hash merge question

Sorry, I don't use these hash items, however if you pop in your sql code there could be optimizations to be had on that side.  E.g. create an intermediary dataset from the big one with only ids from the small one, then merge that on.

Contributor
Posts: 27

Re: Hash merge question

Thanks RW9.  I only know enough SQL to be dangerous.  Here's what I have.

Proc SQL ;

Create Table LiveVirusValidatedData as

Select F.*, L.DoseValidity

     from FullyDeDupedData F left join LiveVaxViolations L

     on (F.VaxRecordNum = L.VaxRecordNum)

     order by CDCID, VaxDate, VaxGroup ;    

     quit ;

Steve

Super User
Super User
Posts: 7,951

Re: Hash merge question

Well, at a quick glance (as I am about to leave) see below to create intermediate dataset, then use that to merge.  Also see if there are any other restrictions you could put in place, maybe summing data up, or transposing the data.

proc sql;

  /* Create an intermediary dataset with only the relevant data, i.e. has a match and only contains columns we need */
  create table inter as
  select  vaxrecordnum,
          dosevalidity
  from    livevaxviolations
  where   vaxrecordnum in (select distinct vaxrecordnum from fullydedupeddate);

  create table livevirusvalidateddata as
  select  f.*,
          l.dosevalidity
  from    fullydedupeddata f
  left join inter l
  on      f.vaxrecordnum = l.vaxrecordnum
  order by  cdcid,
            vaxdate,
            vaxgroup;    
quit;

Solution
‎04-02-2014 11:04 AM
Regular Contributor
Posts: 216

Re: Hash merge question

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

Contributor
Posts: 27

Re: Hash merge question

Posted in reply to AhmedAl_Attar

Thanks Ahmed,

There are not SUPPOSED to be duplicate values of CDCID and VaxRecordNum but apparently there are.  So that's not good news.  At least that gives me something to pursue.  Thanks. 

Steve

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 277 views
  • 3 likes
  • 3 in conversation