Re: syntax for a left join in SAS, not in the context of proc sql?

GN0001 · Posted 07-22-2021 06:01 PM

Hello team,

Would would be the context left and right join in SAS without using in proc sql?

Regards,

Blue blue

Blue Blue

Rydhm · Posted 07-22-2021 06:04 PM

It is merge.
you can read about it here: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/n1i8w2bwu1fn5kn1gpxj18xttbb0.htm

GN0001 · Posted 07-24-2021 02:09 AM

Thanks for sending the link.

It is very hard to understand from this document. I have read it several times!

Regards,

blueblue

Blue Blue

Kurt_Bremser · Posted 07-24-2021 02:17 AM

For an easy entry into the documentation, start with the examples first. Copy the code and run it, then play around with it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 07-22-2021 06:20 PM

You add dataset option IN= to each data set to identify if the current observation comes from the desired set. The select for output those that you want.

Dummy code:

Data want;
   merge data1 (in=inone)
              data2 (in=intwo)
   ;
   by keyvariable;
  /*in a given data step likely only want one of the following
    for most cases.
  */
  /* select records where data1 contributed*/
  if inone;
  /* or select records where data2 contributed*/
  if intwo;
  /* or select records where both data1 and data2*/
  if inone and intwo;
  /* select records from data1 that had no match in data2*/
  if inone and not intwo;
run;

The IN variables are basically boolean: 1/0 for Yes/no so can be used in if statements. (Or summed to identify how many data sets contributed to given record since a MERGE statement can have many data sets ).

However the Merge is still not the same as a "join on" when by values are duplicated in both sets.

Kurt_Bremser · Posted 07-23-2021 02:07 AM

Basically there isn't, because SQL works different when a many-to-many relationship is encountered.

A DATA step can only replicate a left or right join when there is either a one-to-one, one-to-many or many-to-one relationship.

Once that is verified, read the documentation of the MERGE Statement

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 07-24-2021 01:08 PM

As @Kurt_Bremser pointed out, if you have a 1:many, many:1, or 1:1 for every key value, then a MERGE statement can easily work like a left join.

But many-to-many takes more effort, and use of a hash object.


data want;

  set right (in=inR)  left (in=inL);
  by key;

  if _n_=1 then do;
    declare hash r (dataset:'right (obs=0)',multidata:'Y');
      r.definekey('key');
      r.definedata(all:'Y');
      r.definedone();
  end;
  if first.key then r.clear();

  if inR then r.add();
  if inL ;


  if r.num_items=0 then output;
  else do rc=r.find() by 0 until (r.find_next()^=0);
    output;
  end;
  call missing(of _all_);
run;

Basically, for each KEY, all the right-side observations are read first and loaded into a hash object in memory. Then for each subsequent left observation, the program iterates through the hash object, retrieving values from the right-side and outputting for each iteration.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

GN0001 · Posted 07-24-2021 04:42 PM

Hello,

Basically, for each KEY, all the right-side observations are read first and loaded into a hash object in memory. Then for each subsequent left observation, the program iterates through the hash object, retrieving values from the right-side and outputting for each iteration.

Does this

set right (in=inR)  left (in=inL);
  by key;

determine which table should be read first?

If we had:

set Left (in=inL) right(in=inR)

Is the top reads the right table first even if it is the second table in a row?

Respectfully,

blueblue

Blue Blue

mkeintz · Posted 07-25-2021 02:19 PM

@GN0001 wrote:

Hello,

Basically, for each KEY, all the right-side observations are read first and loaded into a hash object in memory. Then for each subsequent left observation, the program iterates through the hash object, retrieving values from the right-side and outputting for each iteration.

Does this
set right (in=inR)  left (in=inL);
  by key;
determine which table should be read first?

If we had:

set Left (in=inL) right(in=inR)

Is the top reads the right table first even if it is the second table in a row?

Respectfully,

blueblue

This is one of those questions that can well be answered by

Examining the Documentation on the SET statement
Take notice of examples 1: concatenating data sets, and 2: interleaving data sets.

and/or
Running some tests - one of the great features of learning any programming language or programming tool like SAS.

More generally, since you want to do a left join, you first want to load the RIGHT data into memory, so that it can be repeatedly and inexpensively looped over for each su7bsequent matching LEFT observation. This in turn means you have to read the RIGHT prior to the LEFT, in order to put it in memory. Thus "set right left". The opposite ("set left right") would yield NO data from RIGHT, since it isn't in memory when processing the LEFT dataset.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!