BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chaitalip
Fluorite | Level 6

Hi,

I am using the merge statement as follows(In SAS University edition):

 

 libname Temp '/folders/myfolders';

data Temp.merge_calls_2;
    merge Temp.trans_2 (in=jkl) Temp.trans_3 (in=iop);
    by col1 col2;

    if jkl and iop;
run;

proc print data=Temp.merge_calls_2;
    title Merge_Calls2;
run;

 

 

In some cases this logic provides me the output similar to SQL Inner join where as, sometimes the output is different.

Can some one pl. explain the exact working of the above merge operation and also a appropriate SQL query to get the same output .

 

Thanks In Advance.

CP.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.

eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:

proc sql;
create table Z as 
select X.*
from X inner join Y
on X.A = Y.A
;
quit;

willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.

data Z;
merge
  X (in=in_x)
  Y (in=in_y)
;
by A;
if in_x and in_y;
run;

will create dataset Z with 3 observations.

With the data step, Z will have the following observations for A = 1:

1st: data from first obs from X and first obs from Y

2nd: data from second obs from X and second obs from Y

3rd: data from third obs from X and second obs from Y

Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.

View solution in original post

9 REPLIES 9
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, inner join is the nearest to that, see below graphic of the various joins.  Obviously the output depends on you data - i.e. multiple entries on either side and such like.  The datastep works by assigning a variable to each dataset which indicates which dataset it comes from, think of it like a new variable.  The if statement just checks that the merged record is present in both datasets.

Am not going over the full process, that is where the manual comes in:

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001318494.htm

Or training:

http://www.ats.ucla.edu/stat/sas/modules/merge.htm

 

Capture.PNG

chaitalip
Fluorite | Level 6

Hi,

the above mentioned link is good but i am not able to get the exact working of  

if jkl and iop

 

From the above posted code by me.

 

Thanks in advance.

CP.

Kurt_Bremser
Super User

jkl and iop are two temporary variables (created throught the dataset options in the merge statement and not included in the output dataset(s)) that are set to 1 if a record from the respective dataset is present, and zero otherwise.

if jkl and iop;

is a so-called "subsetting if" and therefore means that if a record (for the current "by" value) is missing from either dataset, the current iteration of the data step is terminated and no output is written.

Reeza
Super User

What type,of merge are you trying to do, one to one, one to many, or many to many?

Typically these are what cause the differences between a data step and proc SQL and mess up how you think the IN and IF should work. 

chaitalip
Fluorite | Level 6

Hi,

   I think its Many-to-many merging because there are duplicates in both data set.

Kurt_Bremser
Super User

In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.

eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:

proc sql;
create table Z as 
select X.*
from X inner join Y
on X.A = Y.A
;
quit;

willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.

data Z;
merge
  X (in=in_x)
  Y (in=in_y)
;
by A;
if in_x and in_y;
run;

will create dataset Z with 3 observations.

With the data step, Z will have the following observations for A = 1:

1st: data from first obs from X and first obs from Y

2nd: data from second obs from X and second obs from Y

3rd: data from third obs from X and second obs from Y

Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.

chaitalip
Fluorite | Level 6

Thank you for your reply,

Just one more query. Can you provide equivalent SQL query to get the same output.

 

Thanks in advance.

CP.

Kurt_Bremser
Super User

From where does your obsession with SQL come from? If the data step does what you want, stay with it.

(Answer to your question: none, as there isn't one. At least one that makes sense. SQL behaves differently, period)

Reeza
Super User

A many to many merge in a data step does not give you the same results as using a SQL join. 

 

Usually I prefer a SQL step for these joins as the results are more intuitive. 

 

Here's a SAS note regarding this:

 

http://support.sas.com/kb/24/752.html

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 38884 views
  • 8 likes
  • 4 in conversation