- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In the following code, my understanding it's doing a left join. In the result, we will get results from left table for all matching values of PLANT and PO between the two tables. Is my understanding right?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A=1 AND B=0 THEN OUTPUT; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Before discussing the IN= parameters - the MERGE statement can do a left join only if the data have these conditions
- Whenever WORK.MFG has one row, you can have any number of matching rows in WORK.PO (including zero such rows).
or - Whenever WORK.MFG has multiple rows, WORK.PO can have only zero or one matching row.
If you have any instances of many-to-many then merge will not replicate a left join.
Of course, both datasets must be sorted by PLANT PO.
The
IF A=1 and B=0 then output;
statement tells the data step to output only those instances in which there is a WORK.MFG observation, but no matching WORK.PO observation.
To replicate a left join, change the IF to:
IF A=1 then output;
keeping all instances in which there is an observation in WORK.MFG, regardless of the status of matching observations in WORK.PO.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maybe if you run code like this:
DATA WORK.New; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; Left=A; Right=B; RUN;
And examine the values of Left and Right.
NOT the same as a left join when there are multiples of the by values in both datasets (many-to-many).
Not the same as a left join on single values. Left join (unless you add a where clause and you didn't specify the Left join you think might be the same) keeps ALL records from the Left set. The IF statement excludes any that do not have a match in the right set set.
When you replace a data set, such as using the same data set on the Data and Merge (or Set) statement then you have destroyed the original work.mfg and the results are going to be hard to actual follow. AND if you run the same code repeatedly you will find that MERGE can behave differently when some variables other than the BY variables exist in both data sets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do not shout at the poor SAS interpreter, it has done nothing to deserve this. And it makes code less readable.
Your code will only keep observations from MFG that do not have a match in P0.
It can be written more concisely in this way:
data mfg_merge;
merge
mfg (in=a)
p0 (in=b)
;
by plant p0;
if a and not b;
run;
The IN= variables are Boolean variables and can only have the values 0 (false) or 1 (true), so you can use them directly as conditions.
IF <condition> THEN OUTPUT; is equivalent to a Subsetting IF if it appears only once in the step, so it makes for better reading to use the Subsetting IF.
Overwriting a dataset in one step when the contents change is dangerous; if something fails, you need to go back further up in your code to recreate the dataset, wasting time during development. Additional WORK datasets will vanish on their own anyway when the session terminates, and if you actually run out of space, use a PROC DELETE to get rid of unneeded datasets then.
Splitting the MERGE over a separate line for each dataset makes it easier to move such lines around if you have to change the logic of several steps; keeping code elements on separate lines makes working with code easier (hint out of 20+ years of SAS coding).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@David_Billa wrote:
@Kurt_Bremser @ballardw What will be the meaning of IF A THEN DO; in the program below? It is similar to IF A=1?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1; RUN;
There's no meaning, because the code throws a syntax error and does nothing. See DO Statement for details.
Any numeric value that is not zero or missing is considered true, so
if a then ....
is equivalent to
if a not in (.,0) then ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
error. I just wanted to know the meaning of 'if A then...'.
Also what will happen if we mention merge with by variables and without any
if a or b clause? Will it be inner join?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A JOIN in SQL and a MERGE in a data step can only create the same results if you do NOT have a many-to-many relationship. If you have X observations in one dataset and Y observations in the other, you'll get max(X,Y) in a MERGE, but X * Y in a JOIN.
If you have one-to-one, one-to-many or many-to one, a MERGE with BY but no subsetting IF is similar to a FULL JOIN, but you must always beware of variables with identical names. SQL warns you, but the MERGE just accepts it and the resulting data can be quite unpredictable for someone who has yet to make the "deep dive" into the mechanics of the DATA step.
Run this and study the resulting dataset:
data a;
input id value;
datalines;
1 2
2 4
2 5
3 6
3 7
5 10
;
data b;
input id value;
datalines;
1 3
2 2
3 9
4 12
6 18
;
data ab;
merge
a
b
;
by id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
These are good questions, but would encourage you to make a couple tiny datasets and play with it. You can learn a lot from playing with datasets with 3-5 records.
For example, code like:
data left ;
input id x ;
cards ;
1 10
3 30
5 50
;
data right ;
input id y ;
cards ;
1 10
2 20
4 40
;
data want1 ;
merge
left (in=l)
right (in=r)
;
by id ; inLeft=l; inRight=r;
run ;
data want2 ;
merge
left (in=l)
right (in=r)
;
by id ;
if l=1 ;
run ;
data want3 ;
merge
left (in=l)
right (in=r)
;
by id ;
if l=1 and r=1 ;
run ;
data want4 ;
merge
left
right
;
*no by statement;
run ;
If you know how to use the data step debugger, it can be a great help in understanding how the DATA step works. It takes a while to understand how the MERGE statement works. You need a good understanding of the DATA step itself. If you play around with simple code like that, and then come up with an example you don't understand, you can post a fresh questions with sample data and sample code. Also, check lexjansen.com for SAS user group papers about MERGE.
Or, if you know SQL, and would rather think in terms of combining data via SQL joins, you could skip learning merge and decide to use PROC SQL to combine data. : )
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In a MERGE without a BY, the data step reads obs 1 from the first dataset, then obs 1 from the second, then outputs. Then it reads obs 2 from the first, obs 2 from the second, and outputs. Repeat until one dataset is finished, after which the other is read until it runs out of obs, at which point the step terminates.
In the example, id value 1 from left is overwritten with 1 from right, then 3 from left is overwritten with 2 from right, and finally 5 is overwritten with 4. Since x and y are separate variables, their values are preserved.
Since both datasets have three observations, the step also terminates after three iterations.
Add an additional observation to one of the datasets to see how the step works once it runs out of observations in the other dataset.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@David_Billa wrote:
@Quentin @Kurt_Bremser I don't understand the result of WANT4. What is the meaning of merge without BY statement? Why the id value 3 not appearing in the result?
A merge without a BY statement is what I call a 'smush'. The MERGE statement will read the first record from each dataset, and smush them together, then will read the second record from each datataset, and smush them together. There is no by variable to join on. You could think of it like an outer join by row number. It's often a mistake to leave out the BY statement, and there is a system option which allows you to tell SAS to treat this as an error: mergenoby=ERROR.
Note also that if two datasets read on the MERGE statement have the same variable name, and the variable is not listed on the BY statement, they will collide. The value read from the second dataset will overwrite the value read from the first dataset. So in this case, the output dataset WANT4 has ID values id=1, id=2, and id=4 because the values of ID from RIGHT have overwritten the values from LEFT. There is a system option msglevel=i which will write a note in the log when there is a variable name collision.
This code:
data left ;
input id x ;
cards ;
1 10
3 30
5 50
;
data right ;
input id y ;
cards ;
1 10
2 20
4 40
;
options msglevel=i ;
data want4 ;
merge
left (in=a)
right (in=b)
;
*no by statement;
put (id x y a b)(=) ;
run ;
Results in the log:
17 options msglevel=i ; 18 data want4 ; 19 merge 20 left (in=a) 21 right (in=b) 22 ; 23 *no by statement; 24 25 put (id x y a b)(=) ; 26 run ; INFO: The variable id on data set WORK.LEFT will be overwritten by data set WORK.RIGHT. id=1 x=10 y=10 a=1 b=1 id=2 x=30 y=20 a=1 b=1 id=4 x=50 y=40 a=1 b=1 NOTE: There were 3 observations read from the data set WORK.LEFT. NOTE: There were 3 observations read from the data set WORK.RIGHT. NOTE: The data set WORK.WANT4 has 3 observations and 3 variables.
HTH,
--Q.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1;
END; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@David_Billa wrote:
@Kurt_Bremser I'm still unable to understand the meaning of 'IF A THEN DO' in this code. In my understanding A is data from WORK.MFG. If yes, then how can I apply filter like 'IF A THEN DO'?
DATA WORK.MFG; MERGE WORK.MFG (in=A) WORK.PO (in=B); BY PLANT PO; IF A THEN DO; VALUE=1;
END; RUN;
The variable A is created by the data step because it is the name used in the IN= dataset option on the dataset MFG. The value of A will be TRUE whenever the MFG is contributing to the current observation.
If you want to see the value it has then assign its value to another variable that is not going to be dropped from the output dataset. Then you print the data or browse the output dataset and see how it works.
Note that using A nd B for those variables is probably not the best naming convention to use. Personally I use IN as a prefix on such variables to make it clearer what they are.
What do you mean by FILTER? Do you want to subset the observations that are going to be written?
You can DELETE observations you don't want. You can selectively only OUTPUT the observations you want.
Try these steps:
data want1;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if in_mfg;
run;
data want2;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if in_mfg then output;
run;
data want3;
merge mfg (in=in_mfg) po (in=in_po);
by plant po;
if not in_mfg then delete;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And just so we get your wording straight: IN in your question is not an operator, it is a dataset option.
if x in (1,2,3)
/* ^ operator */
merge
a (in=a)
/* ^ dataset option */