BookmarkSubscribeRSS Feed
afiqcjohari
Quartz | Level 8
proc sort data=have1;by var1;run;
data have2 (rename =(var1er=var1));set have2; run;
proc sort data=have2;by var1;run;

data want;
merge have1(in=a) have2(in=b);
by var1; if a; run;


Is it possible to simplify this?

21 REPLIES 21
Jagadishkatam
Amethyst | Level 16
Not sure, in the code the statement 'if a' is doing anything.

we could remove the same as there is no by statement.
Thanks,
Jag
afiqcjohari
Quartz | Level 8
Edited for the by statement.
Kurt_Bremser
Super User

No extra step for the rename is needed:

proc sort data=have1;
by var1;
run;

proc sort data=have2;
by var1er;
run;

data want;
merge
  have1 (in=a)
  have2 (
    in=b
    rename=(var1er=var1)
  )
;
by var1;
if a;
run;
afiqcjohari
Quartz | Level 8
Is it possible to take out the sorting part? Could it be done within the data step merge?
Kurt_Bremser
Super User

@afiqcjohari wrote:
Is it possible to take out the sorting part? Could it be done within the data step merge?

To merge, you always have to sort. Even if you use proc sql and you do not have to sort explicitly, the sql procedure will do it automatically, which basically uses the same resources and needs the same time. If you need the same order repeatedly, an explicit sort is the way to go.

afiqcjohari
Quartz | Level 8

@Kurt_Bremser wrote:

@afiqcjohari wrote:
Is it possible to take out the sorting part? Could it be done within the data step merge?

To merge, you always have to sort. Even if you use proc sql and you do not have to sort explicitly, the sql procedure will do it automatically, which basically uses the same resources and needs the same time. If you need the same order repeatedly, an explicit sort is the way to go.


The initial code was in proc sql, but I prefer data step merge because it can create multiple tables. Had it been in proc sql, I need to write separate proc sql for each different table. But I really want to avoid writing the 2 proc sorts, if possible. But it seems that it's not. What I don't get is, why if data merge necessitates the 2 tables to be sorted by the common variable, shouldn't data merge internally 'know' that the 2 tables will need to be sorted? Meaning, data merge should by default sort these 2 variables implicitly.

 

Is there a case where data merge could be called without the tables being sorted in the first place?

Kurt_Bremser
Super User

The merge statement in a data step is very simple and relies on the given order of the datasets. This is its natural behaviour. Therefore the sort is necessary if you use by with a merge.

This simplicity is one of the reasons why sort & merge often outperforms SQL (which sorts implicitly).

 

A proc sort is usually three lines of code. I type that without really thinking in < 10 seconds. And I'm no touch-typist. To make your work easier, you can define a keyboard macro in EG that writes the sort step for you, so you only have to fill in the dataset and the by variables.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Something like:

proc sql;
  create table WANT as
  select A.VAR1ER as VAR1,
           B.*
  from   HAVE1 A
  left join HAVE2 B
  on      A.VAR1=B.VAR1;
quit;
afiqcjohari
Quartz | Level 8
Thank you. But one of the feature I like about SAS is the data merge step. It's actually pretty elegant minus the need to sort step. :). Actually the real code outputs to many tables which data merge is very good at. Writing in proc sql would mean 2 different proc sql for 2 different tables.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, SQL was developed to handle relational databases, so combining data is the key functioanlity of it.  True there are times when either can be better.  As far as this question, we only see a tiny part of the problem apparently, so can only guess.

afiqcjohari
Quartz | Level 8

I really appreciate your answer. What I don't understand is why does SAS require the programmer to explicitly write the 2 proc sorts on the 2 tables? Why can't data merge does it at one go? Because as per my understanding, data merge requires the tables to be sorted. It could have been more concise if data merge can implicity sort the tables by itself.

Kurt_Bremser
Super User

@afiqcjohari wrote:

I really appreciate your answer. What I don't understand is why does SAS require the programmer to explicitly write the 2 proc sorts on the 2 tables? Why can't data merge does it at one go? Because as per my understanding, data merge requires the tables to be sorted. It could have been more concise if data merge can implicity sort the tables by itself.


Misconception. Merge on its own does NOT require the datasets to be sorted. Merge with a by does need sorting.

Astounding
PROC Star

Perhaps an important point .... MERGE with a BY statement requires that your data set is already in order before the DATA step begins.  If it is already in order without running PROC SORT, that is sufficient.  You only need to run PROC SORT when the data is not in the right order.

ballardw
Super User

Also the BY statement in a data step is used with other things than just MERGE, UPDATE or MODIFY statements in a data step.

And the BY statement supports the NOTSORTED and DESCENDING options and trying to "guess" all the time which combination a user "intended" might be very problematical.

 

And since a basic sort statement is 3 lines of code including the "run;" I am surprised this is much of an issue at all.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 2237 views
  • 5 likes
  • 7 in conversation