Solved: Re: Good method for selecting observations based on min or max date

supp · Posted 09-13-2019 12:41 PM

Hello SAS community,

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:

data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;

We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).

Desired output:

A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15

For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.

proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date);
quit;

Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

Thanks in advance!

edit: fixed first obs in desired output

PGStats · Posted 09-13-2019 12:55 PM

Use a subquery:

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;

PG

View solution in original post

Reeza · Posted 09-13-2019 12:52 PM

Easy and boring with a data step.

Sort by person, date and descending quantity so the maxiumum quantity is first with multiple dates and then take the first record of each person/attribute group.

EDIT: Forgot attribute, added that in.

proc sort data=have;
by personID attribute date descending qt1;
run;

data want;
set have;
by personID attribute date;
if first.attribute;
run;

@supp wrote:

Hello SAS community,

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:
data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;
We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).

Desired output:
A m1 20 2019-03-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.
proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date);
quit;
Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

Thanks in advance!

novinosrin · Posted 09-13-2019 12:54 PM

Hi @supp The way you frame questions is very intriguing and nice. This time, i am just gonna correct your SQL and wait for others to respond with their innovative answers. I am gonna see if I can test myself i.e my own capability to come up with something that others haven't. But mind you, if somebody among the likes of PGstats, Tom, Paul D, Reinhard, John King, Xia Keshan et al etc has answered. I am just gonna take that notes. 🙂 and be in awe.

You missed an equality operator to filter the group having identified the MIN

proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date)=date;
quit;

supp · Posted 09-13-2019 01:37 PM

Thanks for catching my syntax error. That actually helped a lot!

supp · Posted 09-13-2019 02:23 PM

Hi @supp The way you frame questions is very intriguing and nice.

@novinosrin , if you have any suggestions on how to improve asking these questions let me know!

novinosrin · Posted 09-13-2019 02:33 PM

You are doing great. We love it!

PGStats · Posted 09-13-2019 12:55 PM

Use a subquery:

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;

PG

novinosrin · Posted 09-13-2019 01:22 PM

Does this work?

proc sql;
create table want as
 select *
	from work.have
     group by personID, attribute1
	having max(qty)=qty and min(date)=date;
quit;

PGStats · Posted 09-13-2019 01:27 PM

@novinosrin This can't always work because the max(qty) might not occur on the first date.

PG

novinosrin · Posted 09-13-2019 01:29 PM

Thank you Sir @PGStats Great catch

Reeza · Posted 09-13-2019 01:37 PM

It depends on the criteria. I read that if records with multiple minimum dates the first time I coded it, or it could be the max if multiple overall which does change the question a bit.

supp · Posted 09-13-2019 01:36 PM

@PGStats, you are correct. @novinosrin, I think the solution worked in the example I provided because all my scenarios had the max qty on the earliest date per person per attribute. I added person E to the data such that the max qty is not on the earliest date:

data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 10 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
E m2 30 2019-02-01
E m2 20 2019-01-01
;
run;


proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date)=date and qty=max(qty);
quit;

We don't return any observations for person E. Want to return the observation with the earliest date:

A	m1	20	2019-02-01
A	m2	20	2019-01-01
A	m3	20	2019-02-01
B	m3	20	2019-02-15
C	m2	10	2019-01-15
D	m1	20	2019-01-15

novinosrin · Posted 09-13-2019 02:12 PM

That's the reason i don't like PG because he is Prodigygeniusstats aka pgstats 🙂

FreelanceReinh · Posted 09-14-2019 07:28 AM

Hi @supp,

A third competitor besides PROC SQL (using GROUP BY, HAVING and nested subqueries) and the combination of PROC SORT and a DATA step (using BY and FIRST.by-variable or LAST.by-variable) is PROC SUMMARY (or its twin PROC MEANS):

proc summary data=have nway;
class personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

If the input dataset is sorted by the grouping variables, which is the case in your sample data, you can replace the CLASS statement with a BY statement and omit the NWAY option:

proc summary data=have;
by personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

I always have difficulties remembering the syntax of the IDGROUP (alias IDGRP) specification, but actually it's quite logical: Within the BY group (or combination of CLASS variable values) take the min(date) and, if there are ties, use max(qty) as a tiebreaker (and if there are still ties left, use the first of the tied observations in the input dataset -- or the last if the keyword LAST is inserted before OUT). The list of ID variables, (date qty), could include more variables if desired (e.g. all remaining variables of HAVE, if any).

Like the PROC SORT/DATA step approach PROC SUMMARY guarantees that only one observation per group will be included in the output dataset (see tiebreaking rules), whereas the HAVING clause of PROC SQL is prone to include tied observations (which may or may not be desirable, depending on the task).

It's also easily extendable to more than just two ID variables (date, qty). PROC SQL, however, would require additional levels of nested subqueries because the minimum/maximum specifications refer to different groupings and therefore require different GROUP BY clauses.

PROC SUMMARY shares with PROC SQL the advantages of doing everything in one step and not requiring sorted/indexed input data.

(I think the first date in the "desired output" in your initial post should read 2019-02-01.)

hashman · Posted 09-15-2019 01:13 PM

@supp:

From your description, it looks as though the date in the first output record should rather be 2019-02-01, as it precedes 2019-03-01 for the same ID "A" and attribute "m1".

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!