Solved: Re: PROC SQL vs. Data Step Merge not getting same results

JediApprentice · Posted 11-04-2016 11:18 AM

I'm having trouble understanding why my proc sql and data step merge are not getting the same results. It seems to be off by one observation (I'm trying to convert my proc sql into a data step). Here is the full code:

data mydata;
  infile datalines;
  input list $ NDC discount;
datalines;
Nov 11 50
Dec 12 20
Nov 11 50
May 12 .
Mar 11 30
;
run;

data mydata2;
  infile datalines;
  input list $ NDC discount;
datalines;
Jan 11 40
Dec 12 20
Mar 10 50
May 12 .
Mar 11 30
;
run;

proc sql;
  create table x1 as
  select *
  from mydata t1 left join mydata2 t2 on(t1.list=t2.list);
quit;

proc sort data=mydata;
  by list;
run;

proc sort data=mydata2;
  by list;
run;

data x2;
  merge mydata(in=t1) mydata2;
  by list;
  if t1;
run;

Results of x1:

Results of x2:

The second observation from x1 is different from that of x2. Why is this? It must be something extremely simple - forgive me if I'm wasting your time.

Tom · Posted 11-04-2016 01:00 PM

Because your variable names are the same in both data set. With a MERGE the last value read is the one that you will see in the output. WIth SELECT * in SQL the first instance of that variable is the one that is output if there is another variable with the same name then it is ignored.

Try this step and see if it doesn't match the SQL result.

data x3;
  merge mydata2 mydata(in=t1);
  by list;
  if t1;
run;

Since MYDATA is listed last the values of NDC and DISCOUNT from the MYDATA records will overwrite the values from the MYDATA2 records.

View solution in original post

JediApprentice · Posted 11-04-2016 12:19 PM

In other words, why am I getting that observation from mydata2 when I specified "if t1".

Astounding · Posted 11-04-2016 12:50 PM

Since you are merging by LIST, IF T1 means "did this LIST value appear in T1?" If it did, the results of the merge appear in the final data set.

One of the issues involved in getting different results is the meaning of SELECT * from MYDATA. This brings in data values from MYDATA only, and brings in nothing from MYDATA2. The only way MYDATA2 gets used is to compare its LIST values to those in MYDATA. Which observation looks to you like it comes from MYDATA2?

JediApprentice · Posted 11-04-2016 05:00 PM

@Astounding "One of the issues involved in getting different results is the meaning of SELECT * from MYDATA. This brings in data values from MYDATA only, and brings in nothing from MYDATA2."

Then how come when I do this and select * in the proc sql at the end, all values from both datasets get brought in?

data mydata;
  infile datalines;
  input ID $ Country $ Code;
  datalines;
A USA 228
B BGM 210
C CAN 200
;
run;

data mydata2;
  infile datalines;
  input ID $ State $ age;
  datalines;
A WI 15
B CA 12
C MN 20
;
run;

proc sql;
  create table want as
  select *
  from have t1 left join have2 t2 on(t1.ID=t2.ID);
quit;

Is it because I'm now using different variables?

Tom · Posted 11-04-2016 01:00 PM

Because your variable names are the same in both data set. With a MERGE the last value read is the one that you will see in the output. WIth SELECT * in SQL the first instance of that variable is the one that is output if there is another variable with the same name then it is ignored.

Try this step and see if it doesn't match the SQL result.

data x3;
  merge mydata2 mydata(in=t1);
  by list;
  if t1;
run;

Since MYDATA is listed last the values of NDC and DISCOUNT from the MYDATA records will overwrite the values from the MYDATA2 records.

Kurt_Bremser · Posted 11-04-2016 01:16 PM

The log will give you answers. Merging or joining datasets that contain the same variables always gives peculiar results (accompanied by log messages in the case of SQL when using select *) and is generally a BAD IDEA.

While SQL only takes variables from one table, the data step overwrites values in the sequence in which observations are read from the contributing tables.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JediApprentice · Posted 11-04-2016 05:09 PM

@Kurt_Bremser

But SQL does take variables from both tables... I'm not sure I understand. If you have two tables - lets say have multiple variables but they share one variable i ncommon and you do a left join on that, you will get all the other variables from the datasets when you use a select *

Tom · Posted 11-04-2016 08:05 PM

Your problem with SQL is that you cannot create a SAS dataset with two variables that have the same name.

Try this code to see what is happening.

proc sql ;
 select a.name,b.name
 from sashelp.class a
 inner join sashelp.class b
 on a.name = b.name
 ;
 create table test as
 select a.name,b.name
 from sashelp.class a
 inner join sashelp.class b
 on a.name = b.name
 ;
 select * from test ;
quit;

Kurt_Bremser · Posted 11-05-2016 05:33 AM

@JediApprentice wrote:

@Kurt_Bremser

But SQL does take variables from both tables...

ONLY when they have different names.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Classroom Training Available!