Solved: Re: Exact woking of Merge operation (IF A AND B)?

chaitalip · Posted 06-24-2016 04:55 AM

Hi,

I am using the merge statement as follows(In SAS University edition):

libname Temp '/folders/myfolders';

data Temp.merge_calls_2;
   merge Temp.trans_2 (in=jkl) Temp.trans_3 (in=iop);
   by col1 col2;

   if jkl and iop;
run;

proc print data=Temp.merge_calls_2;
   title Merge_Calls2;
run;

In some cases this logic provides me the output similar to SQL Inner join where as, sometimes the output is different.

Can some one pl. explain the exact working of the above merge operation and also a appropriate SQL query to get the same output .

Thanks In Advance.

CP.

Kurt_Bremser · Posted 06-27-2016 02:12 AM

In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.

eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:

proc sql;
create table Z as 
select X.*
from X inner join Y
on X.A = Y.A
;
quit;

willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.

data Z;
merge
  X (in=in_x)
  Y (in=in_y)
;
by A;
if in_x and in_y;
run;

will create dataset Z with 3 observations.

With the data step, Z will have the following observations for A = 1:

1st: data from first obs from X and first obs from Y

2nd: data from second obs from X and second obs from Y

3rd: data from third obs from X and second obs from Y

Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

RW9 · Posted 06-24-2016 05:20 AM

Yes, inner join is the nearest to that, see below graphic of the various joins. Obviously the output depends on you data - i.e. multiple entries on either side and such like. The datastep works by assigning a variable to each dataset which indicates which dataset it comes from, think of it like a new variable. The if statement just checks that the merged record is present in both datasets.

Am not going over the full process, that is where the manual comes in:

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001318494.htm

Or training:

http://www.ats.ucla.edu/stat/sas/modules/merge.htm

chaitalip · Posted 06-24-2016 05:56 AM

Hi,

the above mentioned link is good but i am not able to get the exact working of

if jkl and iop

From the above posted code by me.

Thanks in advance.

CP.

Kurt_Bremser · Posted 06-24-2016 06:22 AM

jkl and iop are two temporary variables (created throught the dataset options in the merge statement and not included in the output dataset(s)) that are set to 1 if a record from the respective dataset is present, and zero otherwise.

if jkl and iop;

is a so-called "subsetting if" and therefore means that if a record (for the current "by" value) is missing from either dataset, the current iteration of the data step is terminated and no output is written.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 06-24-2016 07:03 AM

What type,of merge are you trying to do, one to one, one to many, or many to many?

Typically these are what cause the differences between a data step and proc SQL and mess up how you think the IN and IF should work.

chaitalip · Posted 06-27-2016 01:52 AM

Hi,

I think its Many-to-many merging because there are duplicates in both data set.

Kurt_Bremser · Posted 06-27-2016 02:12 AM

In a many-to-many merge, SQL will create a cartesian join, while the datastep just goes through them sequentially.

eg dataset X has 3 obs for A = 1, while dataset Y has 2 obs for A = 1:

proc sql;
create table Z as 
select X.*
from X inner join Y
on X.A = Y.A
;
quit;

willl create dataset Z with 6 observations; all individual obs from X are joined with all individual obs from Y.

data Z;
merge
  X (in=in_x)
  Y (in=in_y)
;
by A;
if in_x and in_y;
run;

will create dataset Z with 3 observations.

With the data step, Z will have the following observations for A = 1:

1st: data from first obs from X and first obs from Y

2nd: data from second obs from X and second obs from Y

3rd: data from third obs from X and second obs from Y

Once the last record from a BY group is read from one of the participating datasets, its contents will stay in the PDV until no more records for the same BY group come from the other datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

chaitalip · Posted 06-27-2016 07:56 AM

Thank you for your reply,

Just one more query. Can you provide equivalent SQL query to get the same output.

Thanks in advance.

CP.

Kurt_Bremser · Posted 06-27-2016 08:01 AM

From where does your obsession with SQL come from? If the data step does what you want, stay with it.

(Answer to your question: none, as there isn't one. At least one that makes sense. SQL behaves differently, period)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 06-27-2016 02:13 AM

A many to many merge in a data step does not give you the same results as using a SQL join.

Usually I prefer a SQL step for these joins as the results are more intuitive.

Here's a SAS note regarding this:

http://support.sas.com/kb/24/752.html

Classroom Training Available!