Hi, guys
Recently, I've met a strange problem. I can't understand that. Hopefully, you guys can explain to me. Here is it:
Let's see an basic result:
/* 2 variables and 1 observation in table1 */
data table1;
id="001";
num=uniform(1);
run;
/* 2 variables and 5 observation in table2*/
data table2;
input id $ score;
cards;
001 80
001 80
001 90
001 60
001 85
;
run;
/* Inner Join table1 and table2*/
proc sql;
create table result as
select a.*,b.score
from table1 a inner join table2 b
on a.id=b.id;
quit;
proc print;run;
The Data Set Result is:
Notice the order of variable SCORE, is just same as the original order in table2.
But when the number of variables in table1 increases, magical thing happened. Let's see:
/* 12 variables and 1 observation in table1 */
data table1;
id="001";
array num{10} ;
do i=1 to 10;
num{i}=uniform(i)+i;
end;
run;
/* 2 variables and 5 observation in table2 */
data table2;
input id $ score;
cards;
001 80
001 80
001 90
001 60
001 85
;
run;
/* Inner Join table1 and table2*/
proc sql;
create table result as
select a.*,b.score
from table1 a inner join table2 b
on a.id=b.id;
quit;
proc print;run;
The order of variable SCORE in Data Set RESULT has changed:
So, I'm confused. Why did it change?
Next, I run a test, to see the relation between the number of the variables in data set table1 and the order of varialbe SCORE in data set RESULT. Here is the code:
%macro create_table1(count);
%local i;
data table1;
id="001";
%do i=1 %to &count;
num&i=uniform(&i);
%end;
run;
%mend create_table1;
%macro test_join;
proc sql;
drop table differ_all;
quit;
%local i;
%do i=1 %to 20;
%create_table1(&i)
proc sql;
create table result&i as
select a.*,b.score
from table1 a inner join table2 b
on a.id=b.id;
quit;
data differ&i;
merge result(keep=score rename=(score=base)) result&i(keep=score rename=(score=new)) end=last;
length score_base score_new $ 20 differ_flag $ 5;
retain score_base score_new;
score_base=catx(',',score_base,base);
score_new=catx(',',score_new,new);
if last then do;
var_count=&i+1;/*plus the variable ID*/
differ_flag=ifc(score_base=score_new,'true','false');
output;
end;
keep var_count score_base score_new differ_flag;
run;
proc append base=differ_all data=differ&i force;
run;
%end;
proc print data=differ_all;run;
%mend test_join;
data table1;
id="001";
num=uniform(1);
run;
data table2;
input id $ score;
cards;
001 80
001 80
001 90
001 60
001 85
;
run;
proc sql;
create table result as
select a.*,b.score
from table1 a inner join table2 b
on a.id=b.id;
quit;
%test_join
And the result is:
When the number of variables of data set table1 equals 11, then order of SCORE in join table has changed. I test a few times, it's the same.
So, anyone can tell me the reason? And I can't find anything about that in SAS Help and Documents.
TL;DR;
Anwyays from your top bit, SQL is intrinsically an unsorted procedure - i.e. data does not need to be sorted when passed into it (unlike SAS procedures). When the SQL parser runs it creates sort blocks, merge blocks etc. in what is called the SQL plan. This is bigger than the code you write. If you don't explicitly state certian things, such as output order, then it is left to the SQL parser to decide on what to do. You can see what methods are chose using the _method option and see the whole process using _tree:
proc sql _method _tree; ... quit;
for your first question
The order of variable SCORE in Data Set RESULT has changed:
Order is not preserved in a sql join. So results are pretty as much expected.
Thx.
TL;DR;
Anwyays from your top bit, SQL is intrinsically an unsorted procedure - i.e. data does not need to be sorted when passed into it (unlike SAS procedures). When the SQL parser runs it creates sort blocks, merge blocks etc. in what is called the SQL plan. This is bigger than the code you write. If you don't explicitly state certian things, such as output order, then it is left to the SQL parser to decide on what to do. You can see what methods are chose using the _method option and see the whole process using _tree:
proc sql _method _tree; ... quit;
Thanks, that will help!
For added fun examine LEFT JOIN
proc sql; create table result2 as select a.*,b.score from table1 a left join table2 b on a.id=b.id; quit;
and compare the results with inner join.
Basically if you care about the resulting order include an ORDER BY clause to get what you want.
Thx, the order by clause is needed.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.