merge

HeatherNewton · Posted 03-24-2022 09:56 PM

proc sort data=samp_acct;
by org account;
run;

data samp_acct;
merge samp_acct (IN=A) GUARANTEE(in=b);
by org account;
if A;
if b then IND_GUARANTEE=1';
run;

I dont get this. What does it mean to merge with if A and if B? Can you put this into sql join statement for me? thanks

mkeintz · Posted 03-24-2022 11:14 PM

The "if a" statement has no "then" clause. This makes it a subsetting if. Only merged observations in which the a dummy var =1 will be processed any further. The a dummy variable is a data set name parameter per 'samp_acct (in=a)', meaning A=1 for merged observations with a component from samp_acct, and A=0 for merged observations with no component from A. So the "if a" statement means only observations with a component from samp_acct will be output.

The "if b then ..." occurs after the subsetying if. This means that only observations that also have a component from guarantee will have the action specified in the then clause.

BTW, the "if B then ..." has a syntax error (unbalanced quotes).

Note this means that a merged observation could have components from A only, from B only (excluded here by the subsetttng if), or from both A and B.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SASKiwi · Posted 03-24-2022 11:23 PM

The nearest equivalent in SQL is a LEFT JOIN:

proc sql;
  create samp_acct as
  select *
        ,case 
           when not missing(B.account) then 1
           else 0
         end as ind_guarantee
  from samp_acct as a
  left join guarantee as b
  on a.org = b.org
  and a.account = b.account
  ;
quit;

Kurt_Bremser · Posted 03-25-2022 03:59 AM

data samp_acct;
merge
  samp_acct (in=a)
  guarantee (in=b)
;
by org account;
if a; /* subsetting if; only matches in samp_acct will make it through. */
/* org_account values only present in guarantee will be filtered out */
if b then ind_guarantee = 1; /* removed surplus quote here */
/* org_account values present in both datasets will be flagged with a 1 */
run;

Currently, ind_guarantee will be either missing or 1, which can be used as "false" and "true", respectively.

If you don't mind having a 0 instead of missing, you can do this:

data samp_acct;
merge
  samp_acct (in=a)
  guarantee (in=b)
;
by org account;
if a;
ind_guarantee = b;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

HeatherNewton · Posted 03-25-2022 05:19 AM

am I right to say that we only care if a is not null and therefore it is a left join?

I am not sure about the significance of if b statement here

Kurt_Bremser · Posted 03-25-2022 05:50 AM

IF (and that's a big if) you do not have a many-to-many relationship, you can get the same matches with SQL.

The big differences between the data step and SQL are the handling of a many-to-many relationship (SQL creates a cartesian join, the data step just the max number of observations for each BY group), and how variables with equal names end up in the resulting dataset.

I suggest you create some dummy datasets, experiment with code and study the results (Maxim 4). Learning by doing always wins.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 03-25-2022 05:55 AM

The IF B condition checks for the presence of a match in the second dataset (for which the dummy variable b was defined) and then conditionally sets guarantee. That's it, and that's why it is mainly equivalent to using b directly in an assignment, as I suggested. It depends on how you use guarantee later. If you test for non-missing, leave the code as is, but if you only use it for summarization (SUM function) or as boolean value, the assignment method is valid.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 03-25-2022 10:26 AM

@HeatherNewton wrote:

am I right to say that we only care if a is not null and therefore it is a left join?

I am not sure about the significance of if b statement here

I would not assume that. LEFT, or any other join, requires you to indicate which source of common named variables will be in the result table. If you are lazy and using something like A.*, B.* with common named variables in the source sets, besides generating messages about "variable XXX is already on the data set" the first named set/version of the variable in select is kept.

In a data step merge you get no warnings and the last dataset contributes the values for common variables. Additionally the MERGE will report variables of the same name as having different lengths and possible truncation issues. Plus without additional coding you cannot reference the liked name variables from each contributing set.

Plus if you use a BY to indicate common variables to align the sets the data step merge will require that both sets be in the same order or the BY will fail.

Also a Left (or Right) join will have all the records from the first or second set. A data step merge will have all the records combined. If there are not common BY values when merged then both sets contribute to the result, more like outer joins.

Plus Many -to-many on the By variables in a merge does not resemble the result of any of the join in most cases.

merge

Re: merge

Re: merge

Re: merge

Re: merge

Re: merge

Re: merge

Re: merge

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!