BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Excelsius
Obsidian | Level 7

I have a large dataset that has duplicate IDs and dates (DATE1 and DATE2). My goal is to extract all unique ID and DATE1 combinations and within them, select the MAX DATE2 value. I already have a simple code using PROC SUMMARY that works:

data T0;
	input ID $ SEL $ DATE1 :mmddyy10. DATE2 :mmddyy10.;
	format DATE1 mmddyy10. DATE2 mmddyy10.;
	datalines;
100 a . .
101 b 09/13/2020 09/13/2020
102 c 09/14/2020 09/14/2020
102 d 09/17/2020 09/17/2020
103 e 09/17/2020 09/18/2020
103 e 09/18/2020 09/19/2020
103 a 09/18/2020 09/20/2020
103 a 09/18/2020 09/21/2020
;
proc summary data=T0;
	class ID date1;
	types ID*date1;
	output out=T0_sum
		max(date2) = DATE2;
run;

OUTPUT:

Excelsius_0-1623543362947.png

I can also achieve the same with PROC SQL. However, I am trying to increase my understanding of SAS programming and so far have been unable to achieve the same result using FIRST, LAST, and perhaps DO loops. I understand how to create rows with a DO loop (such as the counter below), but I'm unclear as to how I can iterate over existing values:

data T_sum;
	set T0;
	by ID date1;
	if first.ID then counter = 0;
	counter +1;
	format min_date1 mmddyy10.;
run;

I think seeing a working example for this case can help me understand how SAS processes data in general. And I do understand that I will likely have to precede this code with PROC SORT BY all the variables, so it would be interesting if there is a solution that doesn't require sorting either (just like PROC SUMMARY and PROC SQL).

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Excelsius 

 

A working example of a data step with input sorted on all variables is this:

 

data want; set have;
	by ID Date1;
	if last.Date1;
run;

The statement "if last.Date1" is a subsetting if. It means that output is written only for the last observation with a given Date1 value within a given ID, and this is also the observation with the highest Date2 value thanks to the sorting.

 

In most cases writing a SAS program is about getting the job done in the most simple and effecient way, and this is best achieved by using the available tools instead of inventing complex code. It is not possible to avoid a sort, it has to be done somewhere, either in a Proc Sort preceeding a Data Step or  "behind the scenes" in a Proc Something, and the sort + data step is in most cases also the most effecient solution. I the old days we joked about SAS meaning Sort After Sort.

 

It would be possible to avoid sorting on Date2 by stacking Date2-values in an array and loop over the array when the last Date1 is met, or use a 2-dimensional array and avoid sorting on Date1 too, or even use a 3-dimensional array and avoid sorting altogether, but it would be difficult to dimension the arrays to handle unknown input, and it would need a lot of coding and certainly not be as effecient as the preceeding sort.

 

I suggest that you use sort + data step to get the job done, and then google array processing in SAS to get some sample code to use as a starter for learning the techniques. It is fun to work with, so good luck.

View solution in original post

13 REPLIES 13
ChrisNZ
Tourmaline | Level 20

If your data is sorted, this would be the most efficient way too, as you just read the data sequentially once.

Something like:

data T_sum;
 retain START_DTE;
 set T0;
 by ID ;
 if first.ID then START_DTE=DATE;
 if last.ID then do;
  END_DTE=DATE;
  output; 
 end;
run;

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Excelsius 

 

A working example of a data step with input sorted on all variables is this:

 

data want; set have;
	by ID Date1;
	if last.Date1;
run;

The statement "if last.Date1" is a subsetting if. It means that output is written only for the last observation with a given Date1 value within a given ID, and this is also the observation with the highest Date2 value thanks to the sorting.

 

In most cases writing a SAS program is about getting the job done in the most simple and effecient way, and this is best achieved by using the available tools instead of inventing complex code. It is not possible to avoid a sort, it has to be done somewhere, either in a Proc Sort preceeding a Data Step or  "behind the scenes" in a Proc Something, and the sort + data step is in most cases also the most effecient solution. I the old days we joked about SAS meaning Sort After Sort.

 

It would be possible to avoid sorting on Date2 by stacking Date2-values in an array and loop over the array when the last Date1 is met, or use a 2-dimensional array and avoid sorting on Date1 too, or even use a 3-dimensional array and avoid sorting altogether, but it would be difficult to dimension the arrays to handle unknown input, and it would need a lot of coding and certainly not be as effecient as the preceeding sort.

 

I suggest that you use sort + data step to get the job done, and then google array processing in SAS to get some sample code to use as a starter for learning the techniques. It is fun to work with, so good luck.

Excelsius
Obsidian | Level 7

Thank you @ErikLund_Jensen. This is a very elegant and minimalistic solution. I really like it and am curious how it will perform in terms of speed. I also saw lrackley's variation that includes "output" specification:

data max;
	set t0sorted;
	by ID date1 date2;
	if last.date1 then output;
run;

But it seems like specifying output is not necessary here. Would there be a case where it would be required/good practice to specify it in the code?

tarheel13
Rhodochrosite | Level 12

It's not going to hurt anything putting the output in there. I also believe it's better to include date2 in the sort because you when you're outputting the last by group, you want the dates to be sorted in ascending order so you only get the last row, which would be the maximum date. But if you sorted in descending order, then you could do if first.date1. Anyway, there is nothing wrong with using proc sql or proc summary if they make more sense to you and get the right answer. I would go with what you trust.

Excelsius
Obsidian | Level 7

Thank you everyone for some great examples. I learned more in this thread than during the considerable time I spent browsing and checking book references.

In case anyone is curious, here is the performance I observed:

PROC SUMMARY and PROC SQL performed pretty much on par, with SQL few seconds slower (out of about 26min processing time). The DATA step with LAST method performed the slowest, at about 1.5 minutes slower (not huge difference of course). In trying to find out why, I noticed that all the time is sucked up by the required PROC SORT. One of the things I do in my data step is to filter it down to only 1% of the entire large dataset and only then apply the code discussed here. When I broke down PROC SORT into separate subseting and sorting steps, I saw that it's in fact the subseting itself that takes the most time--PROC SORT was very quick after subsetting. I guess PROC SUMMARY and PROC SQL are a bit faster in this case because perhaps they do the analyses on the fly by filtering (WHERE statement), maybe saving I/O operations. I do find it interesting though that PROC SORT alone with the same WHERE statement took longer to process than the entire PROC SUMMARY or PROC SQL program, which also included the same WHERE filtering plus the analyses.

ChrisNZ
Tourmaline | Level 20

> I do find it interesting though that PROC SORT alone with the same WHERE statement took longer to process than the entire PROC SUMMARY or PROC SQL program

Probably for the same reason you highlighted: less I/O as proc sort rewrites the whole table whereas the others only write the summary.

 

Note too that proc sql can use ordering methods other than sorting. This is visible if you add the _method option to proc sql.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Excelsius 

 

It would be interesting to make a similar speed test in my environment, so may I ask how many rows you have in your real data, and the number/lengths of variables besides the three used in the subsetting? 

Excelsius
Obsidian | Level 7

Hi @ErikLund_Jensen,

The initial dataset contains just a little under 1.4 billion records with 59 variables. 50 are character with an average length of 7 and 9 are numeric with average length of 8. The filtering is based on a single variable unrelated to the ones used in the code discussed here. The dataset is about 400GB. However, after filtering, the dataset is reduced to just a little under 2.6% of the total observations (~36 million) and 4 variables. To this subset the codes we discussed here would be applied to reduce it further to ~13 million in the final dataset. Of course, this all happens in a single step in PROC SUMMARY and PROC SQL. I hope that makes sense.

I would also be curious to see what you find. I think the I/O operations speed is going to be important here.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Excelsius 

I think you will find that it performs extremely well. In most cases a SAS Sort + data step performs better than doing the same in a Proc SQL.

 

And about the output statement - actually I prefer the long form with output, because IF normally is followed by THEN, so it looks like something is missing when the short form is used. But I used it to make the code as minimalistic as possible. 

sonofendor
Fluorite | Level 6

Hi, Excelsius!

 

So at first, you need to sort your data in order of the variables that are listed in the BY statement otherwise it won't work out (similar to MERGE BY). In my code I changed the order of rows in  the example dataset so if you remove PROC SORT the second DATA STEP will generate an error.

 

Now to how FIRST and LAST work. I think it would be easier to demonstrate with following example:

data outt;
	set T0;
	by id date1;
	first_ID = first.ID;
	last_ID = last.ID;
	first_date = first.date1;
	last_date = last.date1;
run;

sonofendor_0-1623574941651.png

first.ID and last.ID mark the beginning and the end of each group for ID in whole dataset

first.DATE1 and last.DATE1 mark the beginning and the end of each group for DATE1 inside each group for ID 

So to find the start or the end of any (ID, DATE1) group inside the dataset you should look only at FIRST and LAST for DATE1. Now to finding the max value of DATE2.

 

So how do you find max value in a sequence? You can create a new variable, assign to it the minimal possible value (in our case in can be some very old date like '01JAN1900'd however missing  value also works), read every value and if it's greater than max then make it new max. The problem is that SAS sets all variables to missing when reading new observation from input dataset. You can avoid this by using REATIN statement: it tells SAS not to set this variable to missing when reading new observation. Now since you want to find max value not in the whole dataset but in every group, manually set your variable to missing when reading first observation in a group. And output when you encounter the last observation in a group. 

 

data T0;
	input ID $ SEL $ DATE1 :mmddyy10. DATE2 :mmddyy10.;
	format DATE1 mmddyy10. DATE2 mmddyy10.;
	datalines;
100 a . .
102 d 09/17/2020 09/17/2020
103 e 09/17/2020 09/18/2020
103 e 09/18/2020 09/19/2020
102 c 09/14/2020 09/14/2020
103 a 09/18/2020 09/20/2020
103 a 09/18/2020 09/21/2020
101 b 09/13/2020 09/13/2020
;
run;

proc sort data=T0;
by id date1;
run;

data out;
	set T0;
	by id date1;
	
	/* create new variable for max value */
	length max_date 8;
	format max_date mmddyy10.;
	
	/* sas sets all variables to missing when reading new observation
		to avoid this use retain */
	retain max_date;
	
	/* set max_date to null only when you are reading first observation in a group */
	if first.date1 then call missing (of max_date);
	
	/* looking for max value */
	if date2 > max_date then max_date = date2;
	
	/* for a last observation in a group assign max value to date2 and output */
	if last.date1 then do;
		date2 = max_date;
		output;
	end;
	
	/* keep only the fields you need */
	keep id date1 date2;
run;

sonofendor_1-1623576587443.png

Hope this will help you 🙂

P.S. Here is a good paper about BY statement

Excelsius
Obsidian | Level 7

Thanks for this great explanation @sonofendor. I especially like the sample code that creates explicit flags for FIRST and LAST. What a great way to experiment with and directly see what is SAS doing in the background. I think I will keep using this method to first see the flags myself and only then apply the filtering.

tarheel13
Rhodochrosite | Level 12

I don't think it's necessary to use do loop or retain statement. Why don't you just sort them all in ascending order by ID, date1, date2? Then to get the maximum you just do if last.date1 then output. 

proc sql number;
	select ID, date1, max(date2) as max_date format= mmddyy10. label='Max date2'
		from t0
		group by ID, date1;
quit;

proc sort data=t0 out=t0sorted;
	by ID date1 date2;
run;

data max;
	set t0sorted;
	by ID date1 date2;
	if last.date1 then output;
run;

title "Using last.date2 to get max";
proc print data=max;
run;
title;
tarheel13
Rhodochrosite | Level 12

https://www.listendata.com/2015/03/3-ways-to-find-maximum-value-in-group.html

 

check this website as they have more examples that you could play with. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 4901 views
  • 11 likes
  • 5 in conversation