BookmarkSubscribeRSS Feed
HeatherNewton
Quartz | Level 8
proc sort data=samp_acct;
by org account;
run;

data samp_acct;
merge samp_acct (IN=A) GUARANTEE(in=b);
by org account;
if A;
if b then IND_GUARANTEE=1';
run;

I dont get this. What does it mean to merge with if A and if B? Can you put this into sql  join statement for me? thanks

7 REPLIES 7
mkeintz
PROC Star

The "if a" statement has no "then" clause.  This makes it a subsetting if.  Only merged observations in which the a dummy var =1 will be processed any further.  The a dummy variable is a data set name parameter per  'samp_acct (in=a)',  meaning A=1 for merged observations with a component from samp_acct, and A=0 for merged observations with no component from A.  So the "if a" statement means only observations with a component from samp_acct will be output.

 

The "if b then ..." occurs after the subsetying if.  This means that only observations that also have a component from guarantee will have the action specified in the then clause.    

 

BTW, the "if B then ..." has a syntax error (unbalanced quotes).

 

Note this means that a merged observation could have components from A only, from B only (excluded here by the subsetttng if), or from both A and B.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SASKiwi
PROC Star

The nearest equivalent in SQL is a LEFT JOIN:

proc sql;
  create samp_acct as
  select *
        ,case 
           when not missing(B.account) then 1
           else 0
         end as ind_guarantee
  from samp_acct as a
  left join guarantee as b
  on a.org = b.org
  and a.account = b.account
  ;
quit;
Kurt_Bremser
Super User
data samp_acct;
merge
  samp_acct (in=a)
  guarantee (in=b)
;
by org account;
if a; /* subsetting if; only matches in samp_acct will make it through. */
/* org_account values only present in guarantee will be filtered out */
if b then ind_guarantee = 1; /* removed surplus quote here */
/* org_account values present in both datasets will be flagged with a 1 */
run;

Currently, ind_guarantee will be either missing or 1, which can be used as "false" and "true", respectively.

If you don't mind having a 0 instead of missing, you can do this:

data samp_acct;
merge
  samp_acct (in=a)
  guarantee (in=b)
;
by org account;
if a;
ind_guarantee = b;
run;
HeatherNewton
Quartz | Level 8

am I right to say that we only care if a is not null  and therefore it is a left join?

I am not sure about the significance of if b statement here

Kurt_Bremser
Super User

IF (and that's a big if) you do not have a many-to-many relationship, you can get the same matches with SQL.

The big differences between the data step and SQL are the handling of a many-to-many relationship (SQL creates a cartesian join, the data step just the max number of observations for each BY group), and how variables with equal names end up in the resulting dataset.

I suggest you create some dummy datasets, experiment with code and study the results (Maxim 4). Learning by doing always wins.

 

Kurt_Bremser
Super User

The IF B condition checks for the presence of a match in the second dataset (for which the dummy variable b was defined) and then conditionally sets guarantee. That's it, and that's why it is mainly equivalent to using b directly in an assignment, as I suggested. It depends on how you use guarantee later. If you test for non-missing, leave the code as is, but if you only use it for summarization (SUM function) or as boolean value, the assignment method is valid.

ballardw
Super User

@HeatherNewton wrote:

am I right to say that we only care if a is not null  and therefore it is a left join?

I am not sure about the significance of if b statement here


I would not assume that. LEFT, or any other join, requires you to indicate which source of common named variables will be in the result table. If you are lazy and using something like A.*, B.* with common named variables in the source sets, besides generating messages about "variable XXX is already on the data set" the first named set/version of the variable in select is kept.

 

In a data step merge you get no warnings and the last dataset contributes the values for common variables. Additionally the MERGE will report variables of the same name as having different lengths and possible truncation issues. Plus without additional coding you cannot reference the liked name variables from each contributing set.

Plus if you use a BY to indicate common variables to align the sets the data step merge will require that both sets be in the same order or the BY will fail.

Also a Left (or Right) join will have all the records from the first or second set. A data step merge will have all the records combined. If there are not common BY values when merged then both sets contribute to the result, more like outer joins.

 

Plus Many -to-many on the By variables in a merge does not resemble the result of any of the join in most cases.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 586 views
  • 0 likes
  • 5 in conversation