I have a very large dataset (69 million rec) that I need to merge with a relatively small dataset (28k recs). I found proc sql took too long so I tried to create a merge using a SAS hash object. I wrote the following code to do that. I wanted to verify that everything was merging OK so I created a counter to increment every time there was a match. I expected the counter to match the number of observations in the smaller dataset. The dataset LiveDoseViolations has 27,542 observations in it but when i run the code the counter has the value 28,428. I don't know how that could be the case nor do i really know how to figure out why. Any ideas?
data LiveVirusValidatedData (drop=LiveDoseViolations) ;
if _N_ = 1
then do;
declare hash h(dataset:'LiveVaxViolations') ;
h.defineKey('cdcid','VaxRecordNum') ;
h.defineData('DoseValidity');
h.defineDone();
length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;
call missing(DoseValidity) ;
end;
set FullyDeDupedData end=lastob ;
rc = h.find() ;
if rc = 0 then LiveDoseViolations + 1 ;
if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;
run;
The log is below:
4491 +
4492 +data LiveVirusValidatedData (drop=LiveDoseViolations) ;
4493 +
4494 +if _N_ = 1
4495 +then do;
4496 + declare hash h(dataset:'LiveVaxViolations') ;
4497 + h.defineKey('cdcid','VaxRecordNum') ;
4498 + h.defineData('DoseValidity');
4499 + h.defineDone();
4500 +length CDCID $30 VaxRecordNum 8 DoseValidity $15 ;
4501 + call missing(DoseValidity) ;
4502 + end;
4503 +set FullyDeDupedData end=lastob ;
4504 +rc = h.find() ;
4505 +if rc = 0 then LiveDoseViolations + 1 ;
4506 +if lastob then call symput('LiveDoseViolations',LiveDoseViolations) ;
4507 +run;
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
4506:49
NOTE: There were 27542 observations read from the data set WORK.LIVEVAXVIOLATIONS.
NOTE: There were 69646012 observations read from the data set WORK.FULLYDEDUPEDDATA.
NOTE: The data set WORK.LIVEVIRUSVhasALIDATEDDATA has 69646012 observations and 86 variables.
NOTE: Compressing data set WORK.LIVEVIRUSVALIDATEDDATA decreased size by 73.23 percent.
Compressed is 932110 pages; un-compressed would require 3482302 pages.
NOTE: DATA statement used (Total process time):
real time 6:41.41
cpu time 6:37.80
4508 +
4509 +%put NOTE: there were &LiveDoseViolations Records found with Live Virus Violations ;
NOTE: there were 28428 Records found with Live Virus Violations
4510 +%put Note: This should match the # of observations in Work.LiveVaxViolations ;
Note: This should match the # of observations in Work.LiveVaxViolations
4511 +
@spjcdc,
The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.
Whether that's acceptable or not, that depends on your business operations and logic.
Hope this clarify your results,
Ahmed
Sorry, I don't use these hash items, however if you pop in your sql code there could be optimizations to be had on that side. E.g. create an intermediary dataset from the big one with only ids from the small one, then merge that on.
Thanks RW9. I only know enough SQL to be dangerous. Here's what I have.
Proc SQL ;
Create Table LiveVirusValidatedData as
Select F.*, L.DoseValidity
from FullyDeDupedData F left join LiveVaxViolations L
on (F.VaxRecordNum = L.VaxRecordNum)
order by CDCID, VaxDate, VaxGroup ;
quit ;
Steve
Well, at a quick glance (as I am about to leave) see below to create intermediate dataset, then use that to merge. Also see if there are any other restrictions you could put in place, maybe summing data up, or transposing the data.
proc sql;
/* Create an intermediary dataset with only the relevant data, i.e. has a match and only contains columns we need */
create table inter as
select vaxrecordnum,
dosevalidity
from livevaxviolations
where vaxrecordnum in (select distinct vaxrecordnum from fullydedupeddate);
create table livevirusvalidateddata as
select f.*,
l.dosevalidity
from fullydedupeddata f
left join inter l
on f.vaxrecordnum = l.vaxrecordnum
order by cdcid,
vaxdate,
vaxgroup;
quit;
@spjcdc,
The fact that you have higher Join/Found count (28,428) than your Distinct count (27,542), means there are duplicate ('cdcid','VaxRecordNum') combination/key in your large table (WORK.FULLYDEDUPEDDATA) of 69,646,012 observations.
Whether that's acceptable or not, that depends on your business operations and logic.
Hope this clarify your results,
Ahmed
Thanks Ahmed,
There are not SUPPOSED to be duplicate values of CDCID and VaxRecordNum but apparently there are. So that's not good news. At least that gives me something to pursue. Thanks.
Steve
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.