BookmarkSubscribeRSS Feed
David_Billa
Rhodochrosite | Level 12

In the following code, my understanding it's doing a left join. In the result, we will get results from left table for all matching values of PLANT and PO between the two tables. Is my understanding right?

 

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A=1 AND B=0 THEN OUTPUT;
RUN;
14 REPLIES 14
mkeintz
PROC Star

Before discussing the IN= parameters - the MERGE statement can do a left join only if the data have these conditions

  1. Whenever WORK.MFG has one row, you can have any number of matching rows in WORK.PO (including zero such rows).
    or
  2. Whenever WORK.MFG has multiple rows, WORK.PO can have only zero or one matching row. 

If you have any instances of many-to-many then merge will not replicate a left join.

 

Of course, both datasets must be sorted by PLANT PO.

 

The 

  IF A=1 and B=0 then output;

statement tells the data step to output only those instances in which there is a WORK.MFG observation, but no matching WORK.PO observation.

 

To replicate a left join, change the IF to:

IF A=1 then output;

keeping all instances in which there is an observation in WORK.MFG, regardless of the status of matching observations in WORK.PO.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

Maybe if you run code like this:

DATA WORK.New;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     Left=A;
     Right=B;
RUN;

And examine the values of Left and Right.

NOT the same as a left join when there are multiples of the by values in both datasets (many-to-many).

Not the same as a left join on single values. Left join (unless you add a where clause and you didn't specify the Left join you think might be the same) keeps ALL records from the Left set. The IF statement excludes any that do not have a match in the right set set.

 

When you replace a data set, such as using the same data set on the Data and Merge (or Set) statement then you have destroyed the original work.mfg and the results are going to be hard to actual follow. AND if you run the same code repeatedly you will find that  MERGE can behave differently when some variables other than the BY variables exist in both data sets.

Kurt_Bremser
Super User

Do not shout at the poor SAS interpreter, it has done nothing to deserve this. And it makes code less readable.

Your code will only keep observations from MFG that do not have a match in P0.

It can be written more concisely in this way:

data mfg_merge;
merge
  mfg (in=a)
  p0 (in=b)
;
by plant p0;
if a and not b;
run;

The IN= variables are Boolean variables and can only have the values 0 (false) or 1 (true), so you can use them directly as conditions.

IF <condition> THEN OUTPUT; is equivalent to a Subsetting IF if it appears only once in the step, so it makes for better reading to use the Subsetting IF.

Overwriting a dataset in one step when the contents change is dangerous; if something fails, you need to go back further up in your code to recreate the dataset, wasting time during development. Additional WORK datasets will vanish on their own anyway when the session terminates, and if you actually run out of space, use a PROC DELETE to get rid of unneeded datasets then.

Splitting the MERGE over a separate line for each dataset makes it easier to move such lines around if you have to change the logic of several steps; keeping code elements on separate lines makes working with code easier (hint out of 20+ years of SAS coding).

David_Billa
Rhodochrosite | Level 12

@Kurt_Bremser @ballardw What will be the meaning of  IF A THEN DO; in the program below? It is similar to IF A=1?

 

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
RUN;
Kurt_Bremser
Super User

@David_Billa wrote:

@Kurt_Bremser @ballardw What will be the meaning of  IF A THEN DO; in the program below? It is similar to IF A=1?

 

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
RUN;

There's no meaning, because the code throws a syntax error and does nothing. See DO Statement for details.

 

Any numeric value that is not zero or missing is considered true, so

if a then ....

is equivalent to

if a not in (.,0) then ...
David_Billa
Rhodochrosite | Level 12
Sorry, I missed end statement in my code. Now it won't throw any syntax
error. I just wanted to know the meaning of 'if A then...'.

Also what will happen if we mention merge with by variables and without any
if a or b clause? Will it be inner join?
Kurt_Bremser
Super User

A JOIN in SQL and a MERGE in a data step can only create the same results if you do NOT have a many-to-many relationship. If you have X observations in one dataset and Y observations in the other, you'll get max(X,Y) in a MERGE,  but X * Y in a JOIN.

If you have one-to-one, one-to-many or many-to one, a MERGE with BY but no subsetting IF is similar to a FULL JOIN, but you must always beware of variables with identical names. SQL warns you, but the MERGE just accepts it and the resulting data can be quite unpredictable for someone who has yet to make the "deep dive" into the mechanics of the DATA step.

Run this and study the resulting dataset:

data a;
input id value;
datalines;
1 2
2 4
2 5
3 6
3 7
5 10
;

data b;
input id value;
datalines;
1 3
2 2
3 9
4 12
6 18
;

data ab;
merge
  a
  b
;
by id;
run;
Quentin
Super User

These are good questions, but would encourage you to make a couple tiny datasets and play with it.  You can learn a lot from playing with datasets with 3-5 records. 

 

For example, code like:

 

 

data left ;
  input id x ;
  cards ;
  1 10
  3 30
  5 50
;

data right ;
  input id y ;
  cards ;
  1 10
  2 20
  4 40
;        

data want1 ;
  merge 
   left (in=l)
   right (in=r)
  ;
  by id ;

inLeft=l;
inRight=r; run ; data want2 ; merge left (in=l) right (in=r) ; by id ; if l=1 ; run ; data want3 ; merge left (in=l) right (in=r) ; by id ; if l=1 and r=1 ; run ; data want4 ; merge left right ; *no by statement; run ;

If you know how to use the data step debugger, it can be a great help in understanding how the DATA step works. It takes a while to understand how the MERGE statement works.  You need a good understanding of the DATA step itself.  If you play around with simple code like that, and then come up with an example you don't understand, you can post a fresh questions with sample data and sample code.  Also, check lexjansen.com for SAS user group papers about MERGE.

 

 

Or, if you know SQL, and would rather think in terms of combining data via SQL joins, you could skip learning merge and decide to use PROC SQL to combine data. : )

 

BASUG is hosting free webinars ! Check out our recordings of past webinars: https://www.basug.org/videos. Be sure to subscribe to our email list for notification of future BASUG events.
David_Billa
Rhodochrosite | Level 12

@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?

Kurt_Bremser
Super User

In a MERGE without a BY, the data step reads obs 1 from the first dataset, then obs 1 from the second, then outputs. Then it reads obs 2 from the first, obs 2 from the second, and outputs. Repeat until one dataset is finished, after which the other is read until it runs out of obs, at which point the step terminates.

In the example, id value 1 from left is overwritten with 1 from right, then 3 from left is overwritten with 2 from right, and finally 5 is overwritten with 4. Since x and y are separate variables, their values are preserved.

Since both datasets have three observations, the step also terminates after three iterations.

Add an additional observation to one of the datasets to see how the step works once it runs out of observations in the other dataset.

Quentin
Super User

@David_Billa wrote:

@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?


 

A merge without a BY statement is what I call a 'smush'.  The MERGE statement will read the first record from each dataset, and smush them together, then will read the second record from each datataset, and smush them together.  There is no by variable to join on.  You could think of it like an outer join by row number. It's often a mistake to leave out the BY statement, and there is a system option which allows you to tell SAS to treat this as an error: mergenoby=ERROR.

 

Note also that if two datasets read on the MERGE statement have the same variable name, and the variable is not listed on the BY statement, they will collide.  The value read from the second dataset will overwrite the value read from the first dataset.  So in this case, the output dataset WANT4 has ID values  id=1, id=2, and id=4 because the values of ID from RIGHT have overwritten the values from LEFT.  There is a system option msglevel=i which will write a note in the log when there is a variable name collision.

 

This code:

data left ;
  input id x ;
  cards ;
  1 10
  3 30
  5 50
;

data right ;
  input id y ;
  cards ;
  1 10
  2 20
  4 40
;     

options msglevel=i ;
data want4 ;
  merge 
   left  (in=a)
   right (in=b)
  ;
  *no by statement;

  put (id x y a b)(=) ;
run ;

Results in the log:

 

17   options msglevel=i ;
18   data want4 ;
19     merge
20      left  (in=a)
21      right (in=b)
22     ;
23     *no by statement;
24
25     put (id x y a b)(=) ;
26   run ;

INFO: The variable id on data set WORK.LEFT will be overwritten by data set WORK.RIGHT.
id=1 x=10 y=10 a=1 b=1
id=2 x=30 y=20 a=1 b=1
id=4 x=50 y=40 a=1 b=1
NOTE: There were 3 observations read from the data set WORK.LEFT.
NOTE: There were 3 observations read from the data set WORK.RIGHT.
NOTE: The data set WORK.WANT4 has 3 observations and 3 variables.

HTH,

--Q.

 

 

BASUG is hosting free webinars ! Check out our recordings of past webinars: https://www.basug.org/videos. Be sure to subscribe to our email list for notification of future BASUG events.
David_Billa
Rhodochrosite | Level 12

@Kurt_Bremser  I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?

 

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
END; RUN;

  

Tom
Super User Tom
Super User

@David_Billa wrote:

@Kurt_Bremser  I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?

 

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
END; RUN;

  


The variable A is created by the data step because it is the name used in the IN= dataset option on the dataset MFG.  The value of A will be TRUE whenever the MFG is contributing to the current observation.

 

If you want to see the value it has then assign its value to another variable that is not going to be dropped from the output dataset.  Then you print the data or browse the output dataset and see how it works.

 

Note that using A nd B for those variables is probably not the best naming convention to use.  Personally I use IN as a prefix on such variables to make it clearer what they are.

 

What do you mean by FILTER?  Do you want to subset the observations that are going to be written?

 

You can DELETE observations you don't want.  You can selectively only OUTPUT the observations you want.

 

Try these steps:

data want1;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if in_mfg;
run;

data want2;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if in_mfg then output;
run;

data want3;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if not in_mfg then delete;
run;
Kurt_Bremser
Super User

And just so we get your wording straight: IN in your question is not an operator, it is a dataset option.

if x in (1,2,3)
/*   ^ operator */
merge
  a (in=a)
/*   ^ dataset option */

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2736 views
  • 7 likes
  • 6 in conversation