BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
spjcdc
Calcite | Level 5

I have a very large dataset (69 million rec) that I need to merge with a relatively small dataset (28k recs).  I found proc sql took too long so I tried to create a merge using a SAS hash object.  I wrote the following code to do that.  I wanted to verify that everything was merging OK so I created a counter to increment every time there was a match.  I expected the counter to match the number of observations in the smaller dataset.  The dataset LiveDoseViolations has 27,542 observations in it but when i run the code the counter has the value 28,428.  I don't know how that could be the case nor do i really know how to figure out why.  Any ideas?

data LiveVirusValidatedData (drop=LiveDoseViolations) ;

if _N_ = 1

then do;

     declare hash h(dataset:'LiveVaxViolations') ;

     h.defineKey('cdcid','VaxRecordNum') ;

     h.defineData('DoseValidity');

     h.defineDone();

     length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

     call missing(DoseValidity) ;

     end;

set FullyDeDupedData end=lastob ;

rc = h.find() ;

if rc = 0 then LiveDoseViolations + 1 ;

if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

run;

The log is below:

4491 +

4492 +data LiveVirusValidatedData (drop=LiveDoseViolations) ;

4493 +

4494 +if _N_ = 1

4495 +then do;

4496 +     declare hash h(dataset:'LiveVaxViolations') ;

4497 +     h.defineKey('cdcid','VaxRecordNum') ;

4498 +     h.defineData('DoseValidity');

4499 +     h.defineDone();

4500 +length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;

4501 +     call missing(DoseValidity) ;

4502 +     end;

4503 +set FullyDeDupedData end=lastob ;

4504 +rc = h.find() ;

4505 +if rc = 0 then LiveDoseViolations + 1 ;

4506 +if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;

4507 +run;

NOTE: Numeric values have been converted to character values at the places given by:

      (Line):(Column).

      4506:49  

NOTE: There were 27542 observations read from the data set WORK.LIVEVAXVIOLATIONS.

NOTE: There were 69646012 observations read from the data set WORK.FULLYDEDUPEDDATA.

NOTE: The data set WORK.LIVEVIRUSVhasALIDATEDDATA has 69646012 observations and 86 variables.

NOTE: Compressing data set WORK.LIVEVIRUSVALIDATEDDATA decreased size by 73.23 percent.

      Compressed is 932110 pages; un-compressed would require 3482302 pages.

NOTE: DATA statement used (Total process time):

      real time           6:41.41

      cpu time            6:37.80

4508 +

4509 +%put NOTE: there were &LiveDoseViolations Records found with Live Virus Violations ;

NOTE: there were        28428 Records found with Live Virus Violations

4510 +%put Note: This should match the # of observations in Work.LiveVaxViolations ;

Note: This should match the # of observations in Work.LiveVaxViolations

4511 +

1 ACCEPTED SOLUTION

Accepted Solutions
AhmedAl_Attar
Rhodochrosite | Level 12

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

View solution in original post

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, I don't use these hash items, however if you pop in your sql code there could be optimizations to be had on that side.  E.g. create an intermediary dataset from the big one with only ids from the small one, then merge that on.

spjcdc
Calcite | Level 5

Thanks RW9.  I only know enough SQL to be dangerous.  Here's what I have.

Proc SQL ;

Create Table LiveVirusValidatedData as

Select F.*, L.DoseValidity

     from FullyDeDupedData F left join LiveVaxViolations L

     on (F.VaxRecordNum = L.VaxRecordNum)

     order by CDCID, VaxDate, VaxGroup ;    

     quit ;

Steve

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, at a quick glance (as I am about to leave) see below to create intermediate dataset, then use that to merge.  Also see if there are any other restrictions you could put in place, maybe summing data up, or transposing the data.

proc sql;

  /* Create an intermediary dataset with only the relevant data, i.e. has a match and only contains columns we need */
  create table inter as
  select  vaxrecordnum,
          dosevalidity
  from    livevaxviolations
  where   vaxrecordnum in (select distinct vaxrecordnum from fullydedupeddate);

  create table livevirusvalidateddata as
  select  f.*,
          l.dosevalidity
  from    fullydedupeddata f
  left join inter l
  on      f.vaxrecordnum = l.vaxrecordnum
  order by  cdcid,
            vaxdate,
            vaxgroup;    
quit;

AhmedAl_Attar
Rhodochrosite | Level 12

@spjcdc,

The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.

Whether that's acceptable or not, that depends on your business operations and logic.

Hope this clarify your results,

Ahmed

spjcdc
Calcite | Level 5

Thanks Ahmed,

There are not SUPPOSED to be duplicate values of CDCID and VaxRecordNum but apparently there are.  So that's not good news.  At least that gives me something to pursue.  Thanks. 

Steve

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1115 views
  • 3 likes
  • 3 in conversation