Re: joins different results in Datastep vs Proc sql

pavank · Posted 06-05-2024 04:53 AM

Hi Experts,

Goodmorning,

Here i get different output for left join with Datastep vs Proc sql how works datastep joins and proc sql joins please explain

data EXP1;
Input ID Name$ Marks $ ;
cards;
101 ravi 70%
102 sanjay 80% 
103 suraj 88%
105 kirit 75%
104 anshu 87%
;
run;

data EXP2;
Input ID Name$ Sex $;
cards;
101 ravi M
102 sanjay M
103 suraj M
104 monit M
105 mukesh M
;
run;

proc sort data=exp1;
by id;
run;

proc sort data=exp2;
by id;
run;

title 'LEFT JOIN IN DATASTEP';
data leftjoin;
merge exp1 (in=a) exp2 (in=b);
by id;
if a;
proc print ;
run;

title 'LEFT JOIN IN PROC SQL';
proc sql number;
select * from exp1 a
left join exp2 b
on a.id=b.id;
quit;

Kurt_Bremser · Posted 06-05-2024 05:32 AM

This is the result of your careless, sloppy programming in PROC SQL which also leads to a WARNING (Maxim 2: Read the Log!).

Since the variable name is present in both datasets, SQL will issue a WARNING and take only the first appearance, which is from dataset exp1. You avoid this by using an exhaustive list of variables. The usage of the asterisk is strongly discouraged when doing joins; it may be used for the variables of one particular dataset (e.g. exp1.*)

In the DATA step MERGE (which, BTW, should never be confused with a join, as it has different mechanics), the variable appears once in the PDV. It is populated with the value from exp1 when the observation from there is read, and then overwritten with the value from exp2. If you had more observations for a particular BY group in exp1 than in exp2, some values from exp1 would make it into the result.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mariangela86 · Posted 06-05-2024 06:35 AM

@pavank

A data step MERGE differs from a PROC SQL in:

(1) The sort order of the key variables When using a SQL join the observations in either
data set do not have to be sorted, a DATA step MERGE requires the key variables (BY variables) to
have been sorted in a step before
(2) Cartesian products All joins are

Cartesian products made on specified key variables.
In left join , where there are multiple key values, a complete Cartesian product is created.

With merge, the input observations are joined in sequence where they match.

3) Alias in proc sql you can use alias name for table to to identify variables non-key, in DATA Step MERGE can be used with a logical IN variable to identify which data set contributed the key
values, but the variable names non key, must be unique to each data set since their source data set cannot be identified with an alias. (in your example if you try to generate a table with left join, without rename variables, you take a warning message for Variable Name and ID too)

proc sql number;
create table leftjoin2 as 
select * from exp1 a
left join exp2 b
on a.id=b.id;
quit;

Spoiler

WARNING: Variable ID already exists on file WORK.LEFTJOIN2.
WARNING: Variable Name already exists on file WORK.LEFTJOIN2.

WARNING: Variable ID already exists on file WORK.LEFTJOIN2.WARNING: Variable Name already exists on file WORK.LEFTJOIN2.

proc sql number;
create table leftjoin2 as 
select a.*,b.Sex 
from exp1 a
left join exp2 b
on a.id=b.id;
quit;

4) Runtime PROC SQL join is faster than a DATA Step MERGE; you can use indexex in both
to improve performance

I recommend you read this article!

https://support.sas.com/resources/papers/proceedings/proceedings/sugi25/25/cc/25p109.pdf

The paper explain the principles difference between proc sql and merge.

ballardw · Posted 06-05-2024 09:56 AM

Repeat after me:

A data step MERGE is not a JOIN, any SQL JOIN is not a data step merge.

MERGE does things with non-by variable same named variables that NONE of the joins do without adding explicit code.

Merge require the data sets to be sorted by the BY variable. OR use the NOTSORTED option which can very interesting results when the by values are not grouped together.

A very important part is what happens when duplicates of a BY variable appear in merge. Run this code and then take a look at which values are kept.

data one;
   input x y;
datalines;
1 1
1 2
1 3
2 11
2 22
3 333
;

data two;
   input x z;
datalines;
1 11
1 22
2 111
2 222
2 333
3 3333
3 4444
;

data merged;
   merge one two;
   by x;
run;

In a very small example sometimes the results can look like a join. But don't count on it. They are different.

joins different results in Datastep vs Proc sql