In the following code, my understanding it's doing a left join. In the result, we will get results from left table for all matching values of PLANT and PO between the two tables. Is my understanding right?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A=1 AND B=0 THEN OUTPUT; RUN;
Before discussing the IN= parameters - the MERGE statement can do a left join only if the data have these conditions
If you have any instances of many-to-many then merge will not replicate a left join.
Of course, both datasets must be sorted by PLANT PO.
The
IF A=1 and B=0 then output;
statement tells the data step to output only those instances in which there is a WORK.MFG observation, but no matching WORK.PO observation.
To replicate a left join, change the IF to:
IF A=1 then output;
keeping all instances in which there is an observation in WORK.MFG, regardless of the status of matching observations in WORK.PO.
Maybe if you run code like this:
DATA WORK.New; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; Left=A; Right=B; RUN;
And examine the values of Left and Right.
NOT the same as a left join when there are multiples of the by values in both datasets (many-to-many).
Not the same as a left join on single values. Left join (unless you add a where clause and you didn't specify the Left join you think might be the same) keeps ALL records from the Left set. The IF statement excludes any that do not have a match in the right set set.
When you replace a data set, such as using the same data set on the Data and Merge (or Set) statement then you have destroyed the original work.mfg and the results are going to be hard to actual follow. AND if you run the same code repeatedly you will find that MERGE can behave differently when some variables other than the BY variables exist in both data sets.
Do not shout at the poor SAS interpreter, it has done nothing to deserve this. And it makes code less readable.
Your code will only keep observations from MFG that do not have a match in P0.
It can be written more concisely in this way:
data mfg_merge;
merge
mfg (in=a)
p0 (in=b)
;
by plant p0;
if a and not b;
run;
The IN= variables are Boolean variables and can only have the values 0 (false) or 1 (true), so you can use them directly as conditions.
IF <condition> THEN OUTPUT; is equivalent to a Subsetting IF if it appears only once in the step, so it makes for better reading to use the Subsetting IF.
Overwriting a dataset in one step when the contents change is dangerous; if something fails, you need to go back further up in your code to recreate the dataset, wasting time during development. Additional WORK datasets will vanish on their own anyway when the session terminates, and if you actually run out of space, use a PROC DELETE to get rid of unneeded datasets then.
Splitting the MERGE over a separate line for each dataset makes it easier to move such lines around if you have to change the logic of several steps; keeping code elements on separate lines makes working with code easier (hint out of 20+ years of SAS coding).
@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1; RUN;
@David_Billa wrote:
@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1; RUN;
There's no meaning, because the code throws a syntax error and does nothing. See DO Statement for details.
Any numeric value that is not zero or missing is considered true, so
if a then ....
is equivalent to
if a not in (.,0) then ...
A JOIN in SQL and a MERGE in a data step can only create the same results if you do NOT have a many-to-many relationship. If you have X observations in one dataset and Y observations in the other, you'll get max(X,Y) in a MERGE, but X * Y in a JOIN.
If you have one-to-one, one-to-many or many-to one, a MERGE with BY but no subsetting IF is similar to a FULL JOIN, but you must always beware of variables with identical names. SQL warns you, but the MERGE just accepts it and the resulting data can be quite unpredictable for someone who has yet to make the "deep dive" into the mechanics of the DATA step.
Run this and study the resulting dataset:
data a;
input id value;
datalines;
1 2
2 4
2 5
3 6
3 7
5 10
;
data b;
input id value;
datalines;
1 3
2 2
3 9
4 12
6 18
;
data ab;
merge
a
b
;
by id;
run;
These are good questions, but would encourage you to make a couple tiny datasets and play with it. You can learn a lot from playing with datasets with 3-5 records.
For example, code like:
data left ;
input id x ;
cards ;
1 10
3 30
5 50
;
data right ;
input id y ;
cards ;
1 10
2 20
4 40
;
data want1 ;
merge
left (in=l)
right (in=r)
;
by id ;
inLeft=l;
inRight=r;
run ;
data want2 ;
merge
left (in=l)
right (in=r)
;
by id ;
if l=1 ;
run ;
data want3 ;
merge
left (in=l)
right (in=r)
;
by id ;
if l=1 and r=1 ;
run ;
data want4 ;
merge
left
right
;
*no by statement;
run ;
If you know how to use the data step debugger, it can be a great help in understanding how the DATA step works. It takes a while to understand how the MERGE statement works. You need a good understanding of the DATA step itself. If you play around with simple code like that, and then come up with an example you don't understand, you can post a fresh questions with sample data and sample code. Also, check lexjansen.com for SAS user group papers about MERGE.
Or, if you know SQL, and would rather think in terms of combining data via SQL joins, you could skip learning merge and decide to use PROC SQL to combine data. : )
@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?
In a MERGE without a BY, the data step reads obs 1 from the first dataset, then obs 1 from the second, then outputs. Then it reads obs 2 from the first, obs 2 from the second, and outputs. Repeat until one dataset is finished, after which the other is read until it runs out of obs, at which point the step terminates.
In the example, id value 1 from left is overwritten with 1 from right, then 3 from left is overwritten with 2 from right, and finally 5 is overwritten with 4. Since x and y are separate variables, their values are preserved.
Since both datasets have three observations, the step also terminates after three iterations.
Add an additional observation to one of the datasets to see how the step works once it runs out of observations in the other dataset.
@David_Billa wrote:
@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?
A merge without a BY statement is what I call a 'smush'. The MERGE statement will read the first record from each dataset, and smush them together, then will read the second record from each datataset, and smush them together. There is no by variable to join on. You could think of it like an outer join by row number. It's often a mistake to leave out the BY statement, and there is a system option which allows you to tell SAS to treat this as an error: mergenoby=ERROR.
Note also that if two datasets read on the MERGE statement have the same variable name, and the variable is not listed on the BY statement, they will collide. The value read from the second dataset will overwrite the value read from the first dataset. So in this case, the output dataset WANT4 has ID values id=1, id=2, and id=4 because the values of ID from RIGHT have overwritten the values from LEFT. There is a system option msglevel=i which will write a note in the log when there is a variable name collision.
This code:
data left ;
input id x ;
cards ;
1 10
3 30
5 50
;
data right ;
input id y ;
cards ;
1 10
2 20
4 40
;
options msglevel=i ;
data want4 ;
merge
left (in=a)
right (in=b)
;
*no by statement;
put (id x y a b)(=) ;
run ;
Results in the log:
17 options msglevel=i ; 18 data want4 ; 19 merge 20 left (in=a) 21 right (in=b) 22 ; 23 *no by statement; 24 25 put (id x y a b)(=) ; 26 run ; INFO: The variable id on data set WORK.LEFT will be overwritten by data set WORK.RIGHT. id=1 x=10 y=10 a=1 b=1 id=2 x=30 y=20 a=1 b=1 id=4 x=50 y=40 a=1 b=1 NOTE: There were 3 observations read from the data set WORK.LEFT. NOTE: There were 3 observations read from the data set WORK.RIGHT. NOTE: The data set WORK.WANT4 has 3 observations and 3 variables.
HTH,
--Q.
@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1;
END; RUN;
@David_Billa wrote:
@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1;
END; RUN;
The variable A is created by the data step because it is the name used in the IN= dataset option on the dataset MFG. The value of A will be TRUE whenever the MFG is contributing to the current observation.
If you want to see the value it has then assign its value to another variable that is not going to be dropped from the output dataset. Then you print the data or browse the output dataset and see how it works.
Note that using A nd B for those variables is probably not the best naming convention to use. Personally I use IN as a prefix on such variables to make it clearer what they are.
What do you mean by FILTER? Do you want to subset the observations that are going to be written?
You can DELETE observations you don't want. You can selectively only OUTPUT the observations you want.
Try these steps:
data want1;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if in_mfg;
run;
data want2;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if in_mfg then output;
run;
data want3;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if not in_mfg then delete;
run;
And just so we get your wording straight: IN in your question is not an operator, it is a dataset option.
if x in (1,2,3)
/* ^ operator */
merge
a (in=a)
/* ^ dataset option */
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.