BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Callam1
Obsidian | Level 7
Some advice on the code below please. Is this the right code for a dataset with millions of records where I want to restrict to the most recent record (maximum time_flag) and retain all the columns associated with it for each group ordered and defined by 3 variable? I fear using proc sort would take ages.

* Take only the most recent record for each ID / var1 / var2 group;
PROC SQL;
create table spine as
select distinct
ID,
Var1,
Var2,
Var3,
Var4,
time_flag,
from mydataset as big
group by ID, var1, var2
having time_flag = max(time_flag)
order by ID, var1, var2;
quit;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Try this:

proc sort
  data=mydataset (
    keep=id var1 var2 var3 var4 time_flag
  )
  out=sorted
;
by id var1 var2 time_flag;
run;

data spine;
set sorted;
by id var1 var2;
if last.var2;
run;

SQL will also sort, and it is usually less efficient than PROC SORT.

View solution in original post

9 REPLIES 9
Kurt_Bremser
Super User

Try this:

proc sort
  data=mydataset (
    keep=id var1 var2 var3 var4 time_flag
  )
  out=sorted
;
by id var1 var2 time_flag;
run;

data spine;
set sorted;
by id var1 var2;
if last.var2;
run;

SQL will also sort, and it is usually less efficient than PROC SORT.

Callam1
Obsidian | Level 7
Thank you that is very simple. I didn’t use it as I thought it would have been much slower than the sql code. But you are suggesting the opposite. So a simple sort with sas is actually more efficient when using a big dataset?
Tom
Super User Tom
Super User

Did you try PROC SUMMARY?

Use your primary keys as the CLASS (or if already sorted BY) variables.

Use the IDGROUP and the MAX() type to select the observation with the maximum value of your time/ordering variable. And list the other variables you want copied from that maximum observation.

proc summary nway data=mydataset ;
  class id var1 var2 ;
  output out=spine idgroup (max(time_flag) out (var3 var4 time_flag)=);
run;

You might want to use the DROP= dataset option on the output dataset reference to remove the _TYPE_ _FREQ_ variables that PROC SUMMARY adds.

  output out=spine(drop=_type_ _freq_) idgroup (max(time_flag) out (var3 var4 time_flag)=);

Example:

data have;
  input id var1 var2 flag ;
cards;
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 1
1 3 1 1
;

proc print;
run;

proc summary data=have nway;
 class id var1 ;
 output out=want idgroup (max(var2) out (var2 flag)=) ;
run;

proc print;
run;

Tom_0-1750884457359.png

 

Callam1
Obsidian | Level 7
Thank you. Is the proc summary quicker than proc sort?
SASKiwi
PROC Star

What works faster will depend on your data and you can only prove which is faster by testing each method. Take a subset of your data and try it for yourself.

ballardw
Super User

@Callam1 wrote:
Thank you. Is the proc summary quicker than proc sort?

A likely issue with Proc Summary is getting the values for "other columns" as you said. Every variable would have to be referenced somewhere in the Proc Summary and might be a headache.

Tom
Super User Tom
Super User

@ballardw wrote:

@Callam1 wrote:
Thank you. Is the proc summary quicker than proc sort?

A likely issue with Proc Summary is getting the values for "other columns" as you said. Every variable would have to be referenced somewhere in the Proc Summary and might be a headache.


You can use _ALL_ if you don't mind the warning messages.

751
752  proc summary data=have nway;
753   class id var1 ;
754   output out=want idgroup (max(var2) out (_all_)=) ;
755  run;

WARNING: Variable id already exists on file WORK.WANT.
WARNING: Variable var1 already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.
mkeintz
PROC Star

The PROC SUMMARY offers a neat compact single-pass solution.  It will certainly use a lot less disk input/output resources than the PROC SORT solution, and will probably be a lot faster - assuming there is no memory constraint.

 

BUT ... does your data have the possibility of tied maximum time_flag values for a given ID/VAR1/VAR2?

 

If not, then ignore the rest of this comment.  

 

But if it does, then the PROC SUMMARY might not likely give the same result as the PROC SORT ... if LAST.VAR2 solution.  It will choose different records (with possibly different VAR3/VAR4 values) among the tied records.

 

This is because the default behavior of PROC SORT is to preserve the original order (from the unsorted dataset) of tied records.  So that solution would always choose the latest of the tied records.

 

A quick test of PROC SUMMARY with ties suggests it would always choose the first of such ties.  At least it did so in the test below:

 

data ties;
  set sashelp.class (keep=name sex age weight);
  order='First'; output;
  order='Last' ; output;
run;
proc summary nway data=ties;
  class sex age;
  output out=summ_want (drop=_:) idgroup (max(weight) out (name weight order)=);
run;
proc sort data=ties;
  by sex age weight;
run;
data sort_want;
  set ties;
  by sex age;
  if last.age;
run;

Dataset summ_want has order='First' in every output record, but the PROC SORT approach always has ORDER='Last'.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 920 views
  • 7 likes
  • 6 in conversation