BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I'm having trouble understanding why my proc sql and data step merge are not getting the same results. It seems to be off by one observation (I'm trying to convert my proc sql into a data step). Here is the full code:

 

data mydata;
  infile datalines;
  input list $ NDC discount;
datalines;
Nov 11 50
Dec 12 20
Nov 11 50
May 12 .
Mar 11 30
;
run;

data mydata2;
  infile datalines;
  input list $ NDC discount;
datalines;
Jan 11 40
Dec 12 20
Mar 10 50
May 12 .
Mar 11 30
;
run;

proc sql;
  create table x1 as
  select *
  from mydata t1 left join mydata2 t2 on(t1.list=t2.list);
quit;

proc sort data=mydata;
  by list;
run;

proc sort data=mydata2;
  by list;
run;

data x2;
  merge mydata(in=t1) mydata2;
  by list;
  if t1;
run;

Results of x1:

 

x1.PNG

 

Results of x2:

 

x2.PNG

 

The second observation from x1 is different from that of x2. Why is this? It must be something extremely simple - forgive me if I'm wasting your time.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Because your variable names are the same in both data set.  With a MERGE the last value read is the one that you will see in the output. WIth SELECT * in SQL the first instance of that variable is the one that is output if there is another variable with the same name then it is ignored.  

 

Try this step and see if it doesn't match the SQL result.

data x3;
  merge mydata2 mydata(in=t1);
  by list;
  if t1;
run;

Since MYDATA is listed last the values of NDC and DISCOUNT from the MYDATA records will overwrite the values from the MYDATA2 records.

View solution in original post

8 REPLIES 8
JediApprentice
Pyrite | Level 9

In other words, why am I getting that observation from mydata2 when I specified "if t1".

Astounding
PROC Star

Since you are merging by LIST, IF T1 means "did this LIST value appear in T1?"  If it did, the results of the merge appear in the final data set.

 

One of the issues involved in getting different results is the meaning of SELECT * from MYDATA.  This brings in data values from MYDATA only, and brings in nothing from MYDATA2.  The only way MYDATA2 gets used is to compare its LIST values to those in MYDATA.  Which observation looks to you like it comes from MYDATA2?

JediApprentice
Pyrite | Level 9

@Astounding "One of the issues involved in getting different results is the meaning of SELECT * from MYDATA.  This brings in data values from MYDATA only, and brings in nothing from MYDATA2."

 

Then how come when I do this and select * in the proc sql at the end, all values from both datasets get brought in?

 

data mydata;
  infile datalines;
  input ID $ Country $ Code;
  datalines;
A USA 228
B BGM 210
C CAN 200
;
run;

data mydata2;
  infile datalines;
  input ID $ State $ age;
  datalines;
A WI 15
B CA 12
C MN 20
;
run;

proc sql;
  create table want as
  select *
  from have t1 left join have2 t2 on(t1.ID=t2.ID);
quit;

Is it because I'm now using different variables?

Tom
Super User Tom
Super User

Because your variable names are the same in both data set.  With a MERGE the last value read is the one that you will see in the output. WIth SELECT * in SQL the first instance of that variable is the one that is output if there is another variable with the same name then it is ignored.  

 

Try this step and see if it doesn't match the SQL result.

data x3;
  merge mydata2 mydata(in=t1);
  by list;
  if t1;
run;

Since MYDATA is listed last the values of NDC and DISCOUNT from the MYDATA records will overwrite the values from the MYDATA2 records.

Kurt_Bremser
Super User

The log will give you answers. Merging or joining datasets that contain the same variables always gives peculiar results (accompanied by log messages in the case of SQL when using select *) and is generally a BAD IDEA.

While SQL only takes variables from one table, the data step overwrites values in the sequence in which observations are read from the contributing tables.

JediApprentice
Pyrite | Level 9

@Kurt_Bremser

But SQL does take variables from both tables... I'm not sure I understand. If you have two tables - lets say have multiple variables but they share one variable i ncommon and you do a left join on that, you will get all the other variables from the datasets when you use a select *

Tom
Super User Tom
Super User

Your problem with SQL is that you cannot create a SAS dataset with two variables that have the same name.

Try this code to see what is happening.

proc sql ;
 select a.name,b.name
 from sashelp.class a
 inner join sashelp.class b
 on a.name = b.name
 ;
 create table test as
 select a.name,b.name
 from sashelp.class a
 inner join sashelp.class b
 on a.name = b.name
 ;
 select * from test ;
quit;

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1525 views
  • 2 likes
  • 4 in conversation