data step equivalent of SQL left join

Accepted Solution Solved
Reply
Regular Contributor
Posts: 161
Accepted Solution

data step equivalent of SQL left join

Dear All:

For some reason SQL often report "run out of disk space" when I merge two tables.  So I decide to do the same thing in DATA STEP.  Are the following Data Step the same as the following SQL procedures?

proc sql;

  create table WANT as

  select distinct a.*, b.VAR1 , b.VAR2

  from HAVE1 as a left join HAVE2  (DROP=link_count) as b

  on a.PATIENT_ID = b.PATIENT_ID and a.YEAR = b.YEAR;

quit;

proc sort data=HAVE1;

     by PATIENT_ID YEAR;

run;

proc sort data=HAVE2(drop=link_count);

     by PATIENT_ID YEAR;

run;

data WANT;

     merge HAVE1 (in=a) HAVE2(in=b);

     by PATIENT_ID YEAR;

     if a;

run;


Accepted Solutions
Solution
‎11-17-2013 02:04 AM
Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

You're actually not simply updating an existing data set but you're also adding new columns. This requires re-creation of the data set. A modify statement (changing stuff in place) is only possible for existing columns (variables). Sorry - missed this bit when I've suggested to use "modify".

Nevertheless: The biggest gain will be in using a hash table and so avoidance of sorting -  and as there are no duplicate keys in the source for the hash table it's easy to code (no looping over the hash table required).

Below a code example.

When asking for actual code then may I ask you to provide in the future sample data as I've done it now in the first 2 data steps.

data Big;
  do patient_id=1,2,3;
    format year year4.;
    do year='01jan2010'd,'01jan2011'd,'01jan2012'd,'01jan2013'd;
      do until(Big_Var1<500);
        Big_Var1=ceil(ranuni(1)*1000);
        output;
      end;
    end;
  end;
  stop;
run;

data Small;
  do patient_id=1,3,4;
    format year year4.;
    do year='01jan2010'd,'01jan2012'd,'01jan2013'd;
      Small_Var1=ceil(ranuni(1)*1000);
      Small_Var2=ceil(ranuni(1)*1000);
      output;
    end;
  end;
  stop;
run;

data Big(compress=yes drop=_Smiley Happy;
  set big;

  if _n_=1 then
    do;
      /* map additional variables */
      if 0 then set Small(keep=Small_Var1 Small_Var2);
      /* define hash */
      declare hash h (dataset:'Small(keep=patient_id year Small_Var1 Small_Var2)');
      _rc=h.defineKey('patient_id','year');
      _rc=h.defineData('Small_Var1','Small_Var2');
      _rc=h.defineDone();
    end;
 
  /* h.find() looks up the values in the hash table. If a match is found then the find() method returns a '0'  */
  /* if no match then set variables to missing       */
  if h.find() ne 0 then call missing(Small_Var1, Small_Var2);
run;

View solution in original post


All Replies
Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

In regards of number of rows the result will be the same as long as the relationship between the tables is not many to many.

You need to control from which table same named variables are taken. If you have a variable which is not part of the key but exists in both data sets then the value of the HAVE2 will be used in case of a match but the value of HAVE1 will be used if there is no match. That's very likely not the result you're after.

Also in your SQL you have "b.var1,b.var2". You need to do something similar in the merge or all variables from HAVE2 will become part of WANT.

merge HAVE1 (in=a) HAVE2(in=b keep=patient_id year var1 var2);

If the sole reason for not using a SQL join is disk space then you should probably look to get more space for "WORK" and "UTILLOC" - or point these to locations to a fast disk(s) with sufficient space available.

Regular Contributor
Posts: 161

Re: data step equivalent of SQL left join

Thank you so much, Patrick!


Basically what I'm trying to do here is update HAVE1 using information from HAVE2.  If HAVE2 does not have relevant information based on the two matching criteria (PATIENT_ID AND YEAR), then I want to set VAR1 and VAR2 on WANT to missing, just as the SQL left join does.  I modified my code in accordance with your advice.  How do you think this time?  Will the following DATA STEP does the exactly what the SQL left join I mentioned above does?  By using the KEEP option, the only variables that exist in both HAVE1 and HAVE2 are PATIENT_ID and YEAR.

proc sort data=HAVE1;

     by PATIENT_ID YEAR;

run;

proc sort data=HAVE2(keep=PATIENT_ID YEAR VAR1 VAR2);

     by PATIENT_ID YEAR;

run;

data WANT;

     merge HAVE1 (in=a) HAVE2(in=b);

     by PATIENT_ID YEAR;

     if a;

run;

About the disk size issue.  HAVE1 is 400 GB in size.  HAVE2 is 800 MB in size.  When merging these two dataset, the operation fills the entire server with free disk space of 1.4 TB.  I don't know what went wrong with the SQL code.  That is why I'm suspecting that SQL is doing something inefficient because the SQL code runs perfectly fine for smaller sample.  As a side note, do you think the data step is going to solve the problem?  Many thanks for your help -

Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

If you don't have a many:many relationship between have1 and have2 then the merge syntax you've posted should give you the same result than the SQL left join.

From a performance perspective: What's time consuming is the sort of HAVE1 (also the SQL needs to sort and keeps this sorted data sets in utilloc during execution. That's why it takes up so much space).

The most efficient way would be to have a data step where you load the small table Have2 into a hash table with key "patient_id, year" and data "var1, var2". This way you don't need to sort the two source data sets.

If you want to speed up things even more then you don't create a copy of Have1 but add the 2 new variables "var1, var2" directly to Have1 using a modify statement (in combination with the hash lookup).

Also make sure that have1 is compressed (options compress=yes) - or if creating a new data set "want" then at least make sure that this data set gets created compressed.

Is your key "patient_id, year" the composite primary key (so unique in both source data sets)?

Regular Contributor
Posts: 161

Re: data step equivalent of SQL left join

Is your key "patient_id, year" the composite primary key (so unique in both source data sets)?

No, the combination of patient_id and year uniquely identify each observation in HAVE2, but not HAVE1.  So HAVE2 is the look-up table I hope to pull the attributes (var1, var2) from. 

May I ask how to implement the awesome method you suggested above using modify and hash table using the SQL example I mentioned above.  I'm extremely interested.  I looked up online on this: http://www2.sas.com/proceedings/forum2008/029-2008.pdf and Combining and Modifying SAS DATA SET by example.  But the discussion is highly technical, seemly geared toward data scientists than to users like me.. Smiley Happy

Many thanks !!

Solution
‎11-17-2013 02:04 AM
Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

You're actually not simply updating an existing data set but you're also adding new columns. This requires re-creation of the data set. A modify statement (changing stuff in place) is only possible for existing columns (variables). Sorry - missed this bit when I've suggested to use "modify".

Nevertheless: The biggest gain will be in using a hash table and so avoidance of sorting -  and as there are no duplicate keys in the source for the hash table it's easy to code (no looping over the hash table required).

Below a code example.

When asking for actual code then may I ask you to provide in the future sample data as I've done it now in the first 2 data steps.

data Big;
  do patient_id=1,2,3;
    format year year4.;
    do year='01jan2010'd,'01jan2011'd,'01jan2012'd,'01jan2013'd;
      do until(Big_Var1<500);
        Big_Var1=ceil(ranuni(1)*1000);
        output;
      end;
    end;
  end;
  stop;
run;

data Small;
  do patient_id=1,3,4;
    format year year4.;
    do year='01jan2010'd,'01jan2012'd,'01jan2013'd;
      Small_Var1=ceil(ranuni(1)*1000);
      Small_Var2=ceil(ranuni(1)*1000);
      output;
    end;
  end;
  stop;
run;

data Big(compress=yes drop=_Smiley Happy;
  set big;

  if _n_=1 then
    do;
      /* map additional variables */
      if 0 then set Small(keep=Small_Var1 Small_Var2);
      /* define hash */
      declare hash h (dataset:'Small(keep=patient_id year Small_Var1 Small_Var2)');
      _rc=h.defineKey('patient_id','year');
      _rc=h.defineData('Small_Var1','Small_Var2');
      _rc=h.defineDone();
    end;
 
  /* h.find() looks up the values in the hash table. If a match is found then the find() method returns a '0'  */
  /* if no match then set variables to missing       */
  if h.find() ne 0 then call missing(Small_Var1, Small_Var2);
run;

Super Contributor
Super Contributor
Posts: 440

Re: data step equivalent of SQL left join

1.Can he consider indexing here?

2.I read in an article compressing is only good if the set char variables or char variables outnumber the num variables?

Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

Index: A hash table gets fully loaded into memory, an index doesn't. In the case here where it's not only about a look-up of a few rows the hash will outperform any index.

Compression: Can you post the link to this article? What if you have 20 numeric vars and 1 character var - but the character var has a length of thousand and is most of the time missing?

Super Contributor
Super Contributor
Posts: 440

Re: data step equivalent of SQL left join

this is the file i read a couple of weeks ago

http://www.ats.ucla.edu/stat/sas/library/nesug00/ad1007.pdf

then i tried to compress a dataset with mostly num variables and the set did increase after the compression but never tried it on sets with long char variables though

Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

There is also binary compression available for datasets with mainly numeric variables: "compress=binary"

SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

Regular Contributor
Posts: 161

Re: data step equivalent of SQL left join

Hi, Patrick:

I adopted your code as follows and it works like magic.  I'm so happy.  Hopefully, it will implement better when applied to large dataset.  One question: could you take a look at my newly attached sample data to see why using SQL is so inefficient?  I feel it may have something to do with the format, causing the SQL to form Cartesian product.  Thus, the intermediate table becomes exceedingly big.  When I run on server, it ate ups to 3 TB disk space, even though the HAVE1 only has 400 GB in size.  Am I right on this?  Thank you so much Smiley Happy

proc sql;

  create table ftemp.combo as

  select a.*, b.diag_date, b.diag_type

  from big as a left join small as b

  on a.patient_id = b.patient_id and a.document_id = b.document_id;

quit;

proc sort data=ftemp.combo; by record_retrival_id; run;

data ftemp.combo2 (compress=yes drop=_Smiley Happy;

  set big;

  if _n_=1 then

    do;

      /* map additional variables */

      if 0 then set Small (keep=diag_date diag_type);

      /* define hash */

      declare hash h (dataset:'Small(keep=patient_id document_id diag_date diag_type)');

      _rc=h.defineKey('patient_id','document_id');

      _rc=h.defineData('diag_date','diag_type');

      _rc=h.defineDone();

    end;

 

  /* h.find() looks up the values in the hash table. If a match is found then the find() method returns a '0'  */

  /* if no match then set variables to missing       */

  if h.find() ne 0 then call missing(diag_date, diag_type);

run;

proc sort data=ftemp.combo2; by record_retrival_id; run;

proc compare base=ftemp.combo compare=ftemp.combo2 ;

run;

Respected Advisor
Posts: 3,894

Re: data step equivalent of SQL left join

It's normally best to have one question asked and resolved per thread.

Suggest you start a new post where you specifically ask for an explanation why using your SQL code takes up so much disk space and post your sample data there. Cross reference this post in your new post and clearly state that you're not after another coding solution but after a better understanding of what the SQL does.

Regular Contributor
Posts: 161

Re: data step equivalent of SQL left join

Thanks Smiley Happy

Is it possible to use hash table when I hope to join a table based on whether a date from table A falls between date1 and date2 in table B?  Thanks -

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 2520 views
  • 0 likes
  • 3 in conversation