proc sort data=have1;by var1;run;
data have2 (rename =(var1er=var1));set have2; run;
proc sort data=have2;by var1;run;
data want;
merge have1(in=a) have2(in=b);
by var1;
if a;
run;
Is it possible to simplify this?
No extra step for the rename is needed:
proc sort data=have1;
by var1;
run;
proc sort data=have2;
by var1er;
run;
data want;
merge
have1 (in=a)
have2 (
in=b
rename=(var1er=var1)
)
;
by var1;
if a;
run;
@afiqcjohari wrote:
Is it possible to take out the sorting part? Could it be done within the data step merge?
To merge, you always have to sort. Even if you use proc sql and you do not have to sort explicitly, the sql procedure will do it automatically, which basically uses the same resources and needs the same time. If you need the same order repeatedly, an explicit sort is the way to go.
@Kurt_Bremser wrote:
@afiqcjohari wrote:
Is it possible to take out the sorting part? Could it be done within the data step merge?To merge, you always have to sort. Even if you use proc sql and you do not have to sort explicitly, the sql procedure will do it automatically, which basically uses the same resources and needs the same time. If you need the same order repeatedly, an explicit sort is the way to go.
The initial code was in proc sql, but I prefer data step merge because it can create multiple tables. Had it been in proc sql, I need to write separate proc sql for each different table. But I really want to avoid writing the 2 proc sorts, if possible. But it seems that it's not. What I don't get is, why if data merge necessitates the 2 tables to be sorted by the common variable, shouldn't data merge internally 'know' that the 2 tables will need to be sorted? Meaning, data merge should by default sort these 2 variables implicitly.
Is there a case where data merge could be called without the tables being sorted in the first place?
The merge statement in a data step is very simple and relies on the given order of the datasets. This is its natural behaviour. Therefore the sort is necessary if you use by with a merge.
This simplicity is one of the reasons why sort & merge often outperforms SQL (which sorts implicitly).
A proc sort is usually three lines of code. I type that without really thinking in < 10 seconds. And I'm no touch-typist. To make your work easier, you can define a keyboard macro in EG that writes the sort step for you, so you only have to fill in the dataset and the by variables.
Something like:
proc sql; create table WANT as select A.VAR1ER as VAR1, B.* from HAVE1 A left join HAVE2 B on A.VAR1=B.VAR1; quit;
Well, SQL was developed to handle relational databases, so combining data is the key functioanlity of it. True there are times when either can be better. As far as this question, we only see a tiny part of the problem apparently, so can only guess.
I really appreciate your answer. What I don't understand is why does SAS require the programmer to explicitly write the 2 proc sorts on the 2 tables? Why can't data merge does it at one go? Because as per my understanding, data merge requires the tables to be sorted. It could have been more concise if data merge can implicity sort the tables by itself.
@afiqcjohari wrote:
I really appreciate your answer. What I don't understand is why does SAS require the programmer to explicitly write the 2 proc sorts on the 2 tables? Why can't data merge does it at one go? Because as per my understanding, data merge requires the tables to be sorted. It could have been more concise if data merge can implicity sort the tables by itself.
Misconception. Merge on its own does NOT require the datasets to be sorted. Merge with a by does need sorting.
Perhaps an important point .... MERGE with a BY statement requires that your data set is already in order before the DATA step begins. If it is already in order without running PROC SORT, that is sufficient. You only need to run PROC SORT when the data is not in the right order.
Also the BY statement in a data step is used with other things than just MERGE, UPDATE or MODIFY statements in a data step.
And the BY statement supports the NOTSORTED and DESCENDING options and trying to "guess" all the time which combination a user "intended" might be very problematical.
And since a basic sort statement is 3 lines of code including the "run;" I am surprised this is much of an issue at all.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.