Re: Understanding of Merge using IN operator

David_Billa · Posted 04-11-2022 09:24 PM

In the following code, my understanding it's doing a left join. In the result, we will get results from left table for all matching values of PLANT and PO between the two tables. Is my understanding right?

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A=1 AND B=0 THEN OUTPUT;
RUN;

mkeintz · Posted 04-11-2022 10:59 PM

Before discussing the IN= parameters - the MERGE statement can do a left join only if the data have these conditions

Whenever WORK.MFG has one row, you can have any number of matching rows in WORK.PO (including zero such rows).
or
Whenever WORK.MFG has multiple rows, WORK.PO can have only zero or one matching row.

If you have any instances of many-to-many then merge will not replicate a left join.

Of course, both datasets must be sorted by PLANT PO.

The

  IF A=1 and B=0 then output;

statement tells the data step to output only those instances in which there is a WORK.MFG observation, but no matching WORK.PO observation.

To replicate a left join, change the IF to:

IF A=1 then output;

keeping all instances in which there is an observation in WORK.MFG, regardless of the status of matching observations in WORK.PO.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 04-12-2022 01:39 AM

Maybe if you run code like this:

DATA WORK.New;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     Left=A;
     Right=B;
RUN;

And examine the values of Left and Right.

NOT the same as a left join when there are multiples of the by values in both datasets (many-to-many).

Not the same as a left join on single values. Left join (unless you add a where clause and you didn't specify the Left join you think might be the same) keeps ALL records from the Left set. The IF statement excludes any that do not have a match in the right set set.

When you replace a data set, such as using the same data set on the Data and Merge (or Set) statement then you have destroyed the original work.mfg and the results are going to be hard to actual follow. AND if you run the same code repeatedly you will find that MERGE can behave differently when some variables other than the BY variables exist in both data sets.

Kurt_Bremser · Posted 04-12-2022 02:44 AM

Do not shout at the poor SAS interpreter, it has done nothing to deserve this. And it makes code less readable.

Your code will only keep observations from MFG that do not have a match in P0.

It can be written more concisely in this way:

data mfg_merge;
merge
  mfg (in=a)
  p0 (in=b)
;
by plant p0;
if a and not b;
run;

The IN= variables are Boolean variables and can only have the values 0 (false) or 1 (true), so you can use them directly as conditions.

IF <condition> THEN OUTPUT; is equivalent to a Subsetting IF if it appears only once in the step, so it makes for better reading to use the Subsetting IF.

Overwriting a dataset in one step when the contents change is dangerous; if something fails, you need to go back further up in your code to recreate the dataset, wasting time during development. Additional WORK datasets will vanish on their own anyway when the session terminates, and if you actually run out of space, use a PROC DELETE to get rid of unneeded datasets then.

Splitting the MERGE over a separate line for each dataset makes it easier to move such lines around if you have to change the logic of several steps; keeping code elements on separate lines makes working with code easier (hint out of 20+ years of SAS coding).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

David_Billa · Posted 04-12-2022 05:47 AM

@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
RUN;

Kurt_Bremser · Posted 04-12-2022 06:31 AM

@David_Billa wrote:

@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?
DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
RUN;

There's no meaning, because the code throws a syntax error and does nothing. See DO Statement for details.

Any numeric value that is not zero or missing is considered true, so

if a then ....

is equivalent to

if a not in (.,0) then ...

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

David_Billa · Posted 04-12-2022 07:44 AM

Sorry, I missed end statement in my code. Now it won't throw any syntax
error. I just wanted to know the meaning of 'if A then...'.

Also what will happen if we mention merge with by variables and without any
if a or b clause? Will it be inner join?

Kurt_Bremser · Posted 04-12-2022 08:09 AM

A JOIN in SQL and a MERGE in a data step can only create the same results if you do NOT have a many-to-many relationship. If you have X observations in one dataset and Y observations in the other, you'll get max(X,Y) in a MERGE, but X * Y in a JOIN.

If you have one-to-one, one-to-many or many-to one, a MERGE with BY but no subsetting IF is similar to a FULL JOIN, but you must always beware of variables with identical names. SQL warns you, but the MERGE just accepts it and the resulting data can be quite unpredictable for someone who has yet to make the "deep dive" into the mechanics of the DATA step.

Run this and study the resulting dataset:

data a;
input id value;
datalines;
1 2
2 4
2 5
3 6
3 7
5 10
;

data b;
input id value;
datalines;
1 3
2 2
3 9
4 12
6 18
;

data ab;
merge
  a
  b
;
by id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Quentin · Posted 04-12-2022 08:14 AM

These are good questions, but would encourage you to make a couple tiny datasets and play with it. You can learn a lot from playing with datasets with 3-5 records.

For example, code like:

data left ;
  input id x ;
  cards ;
  1 10
  3 30
  5 50
;

data right ;
  input id y ;
  cards ;
  1 10
  2 20
  4 40
;        

data want1 ;
  merge 
   left (in=l)
   right (in=r)
  ;
  by id ;
 
  inLeft=l;
  inRight=r;
run ;


data want2 ;
  merge 
   left  (in=l)
   right (in=r)
  ;
  by id ;
  if l=1 ;
run ;

data want3 ;
  merge 
   left  (in=l)
   right (in=r)
  ;
  by id ;
  if l=1 and r=1 ;
run ;

data want4 ;
  merge 
   left  
   right 
  ;
  *no by statement;
run ;

If you know how to use the data step debugger, it can be a great help in understanding how the DATA step works. It takes a while to understand how the MERGE statement works. You need a good understanding of the DATA step itself. If you play around with simple code like that, and then come up with an example you don't understand, you can post a fresh questions with sample data and sample code. Also, check lexjansen.com for SAS user group papers about MERGE.

Or, if you know SQL, and would rather think in terms of combining data via SQL joins, you could skip learning merge and decide to use PROC SQL to combine data. : )

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

David_Billa · Posted 04-18-2022 03:50 AM

@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?

Kurt_Bremser · Posted 04-18-2022 08:59 AM

In a MERGE without a BY, the data step reads obs 1 from the first dataset, then obs 1 from the second, then outputs. Then it reads obs 2 from the first, obs 2 from the second, and outputs. Repeat until one dataset is finished, after which the other is read until it runs out of obs, at which point the step terminates.

In the example, id value 1 from left is overwritten with 1 from right, then 3 from left is overwritten with 2 from right, and finally 5 is overwritten with 4. Since x and y are separate variables, their values are preserved.

Since both datasets have three observations, the step also terminates after three iterations.

Add an additional observation to one of the datasets to see how the step works once it runs out of observations in the other dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Quentin · Posted 04-18-2022 09:00 AM

@David_Billa wrote:

@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?

A merge without a BY statement is what I call a 'smush'. The MERGE statement will read the first record from each dataset, and smush them together, then will read the second record from each datataset, and smush them together. There is no by variable to join on. You could think of it like an outer join by row number. It's often a mistake to leave out the BY statement, and there is a system option which allows you to tell SAS to treat this as an error: mergenoby=ERROR.

Note also that if two datasets read on the MERGE statement have the same variable name, and the variable is not listed on the BY statement, they will collide. The value read from the second dataset will overwrite the value read from the first dataset. So in this case, the output dataset WANT4 has ID values id=1, id=2, and id=4 because the values of ID from RIGHT have overwritten the values from LEFT. There is a system option msglevel=i which will write a note in the log when there is a variable name collision.

This code:

data left ;
  input id x ;
  cards ;
  1 10
  3 30
  5 50
;

data right ;
  input id y ;
  cards ;
  1 10
  2 20
  4 40
;     

options msglevel=i ;
data want4 ;
  merge 
   left  (in=a)
   right (in=b)
  ;
  *no by statement;

  put (id x y a b)(=) ;
run ;

Results in the log:

17   options msglevel=i ;
18   data want4 ;
19     merge
20      left  (in=a)
21      right (in=b)
22     ;
23     *no by statement;
24
25     put (id x y a b)(=) ;
26   run ;

INFO: The variable id on data set WORK.LEFT will be overwritten by data set WORK.RIGHT.
id=1 x=10 y=10 a=1 b=1
id=2 x=30 y=20 a=1 b=1
id=4 x=50 y=40 a=1 b=1
NOTE: There were 3 observations read from the data set WORK.LEFT.
NOTE: There were 3 observations read from the data set WORK.RIGHT.
NOTE: The data set WORK.WANT4 has 3 observations and 3 variables.

HTH,

--Q.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

David_Billa · Posted 04-18-2022 05:59 AM

@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?

DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
     END;
RUN;

Tom · Posted 04-18-2022 08:28 AM

@David_Billa wrote:

@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?
DATA WORK.MFG;
     MERGE WORK.MFG (in=A) WORK.PO (in=B);
     BY PLANT PO;
     IF A THEN DO;
      VALUE=1;
     END;
RUN;

The variable A is created by the data step because it is the name used in the IN= dataset option on the dataset MFG. The value of A will be TRUE whenever the MFG is contributing to the current observation.

If you want to see the value it has then assign its value to another variable that is not going to be dropped from the output dataset. Then you print the data or browse the output dataset and see how it works.

Note that using A nd B for those variables is probably not the best naming convention to use. Personally I use IN as a prefix on such variables to make it clearer what they are.

What do you mean by FILTER? Do you want to subset the observations that are going to be written?

You can DELETE observations you don't want. You can selectively only OUTPUT the observations you want.

Try these steps:

data want1;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if in_mfg;
run;

data want2;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if in_mfg then output;
run;

data want3;
  merge mfg (in=in_mfg) po (in=in_po);
  by plant po;
  if not in_mfg then delete;
run;

Kurt_Bremser · Posted 04-12-2022 06:35 AM

And just so we get your wording straight: IN in your question is not an operator, it is a dataset option.

if x in (1,2,3)
/*   ^ operator */
merge
  a (in=a)
/*   ^ dataset option */

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away