Hi,
I am using the merge statement as follows(In SAS University edition):
libname Temp '/folders/myfolders';
data Temp.merge_calls_2;
merge Temp.trans_2 (in=jkl) Temp.trans_3 (in=iop);
by col1 col2;
if jkl and iop;
run;
proc print data=Temp.merge_calls_2;
title Merge_Calls2;
run;
In some cases this logic provides me the output similar to SQL Inner join where as, sometimes the output is different.
Can some one pl. explain the exact working of the above merge operation and also a appropriate SQL query to get the same output .
Thanks In Advance.
CP.
In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.
eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:
proc sql;
create table Z as
select X.*
from X inner join Y
on X.A = Y.A
;
quit;
willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.
data Z;
merge
X (in=in_x)
Y (in=in_y)
;
by A;
if in_x and in_y;
run;
will create dataset Z with 3 observations.
With the data step, Z will have the following observations for A = 1:
1st: data from first obs from X and first obs from Y
2nd: data from second obs from X and second obs from Y
3rd: data from third obs from X and second obs from Y
Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.
Yes, inner join is the nearest to that, see below graphic of the various joins. Obviously the output depends on you data - i.e. multiple entries on either side and such like. The datastep works by assigning a variable to each dataset which indicates which dataset it comes from, think of it like a new variable. The if statement just checks that the merged record is present in both datasets.
Am not going over the full process, that is where the manual comes in:
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001318494.htm
Or training:
http://www.ats.ucla.edu/stat/sas/modules/merge.htm
Hi,
the above mentioned link is good but i am not able to get the exact working of
if jkl and iop
From the above posted code by me.
Thanks in advance.
CP.
jkl and iop are two temporary variables (created throught the dataset options in the merge statement and not included in the output dataset(s)) that are set to 1 if a record from the respective dataset is present, and zero otherwise.
if jkl and iop;
is a so-called "subsetting if" and therefore means that if a record (for the current "by" value) is missing from either dataset, the current iteration of the data step is terminated and no output is written.
What type,of merge are you trying to do, one to one, one to many, or many to many?
Typically these are what cause the differences between a data step and proc SQL and mess up how you think the IN and IF should work.
Hi,
I think its Many-to-many merging because there are duplicates in both data set.
In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.
eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:
proc sql;
create table Z as
select X.*
from X inner join Y
on X.A = Y.A
;
quit;
willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.
data Z;
merge
X (in=in_x)
Y (in=in_y)
;
by A;
if in_x and in_y;
run;
will create dataset Z with 3 observations.
With the data step, Z will have the following observations for A = 1:
1st: data from first obs from X and first obs from Y
2nd: data from second obs from X and second obs from Y
3rd: data from third obs from X and second obs from Y
Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.
Thank you for your reply,
Just one more query. Can you provide equivalent SQL query to get the same output.
Thanks in advance.
CP.
From where does your obsession with SQL come from? If the data step does what you want, stay with it.
(Answer to your question: none, as there isn't one. At least one that makes sense. SQL behaves differently, period)
A many to many merge in a data step does not give you the same results as using a SQL join.
Usually I prefer a SQL step for these joins as the results are more intuitive.
Here's a SAS note regarding this:
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.