Pyrite | Level 9

## Good method for selecting observations based on min or max date

Hello SAS community,

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:

data have;
infile datalines;
input personID \$ attribute1 \$ qty  date yymmdd10.;
format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;

We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).

Desired output:

A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15

For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.

proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date);
quit;

Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

edit: fixed first obs in desired output

1 ACCEPTED SOLUTION

Accepted Solutions
Opal | Level 21

## Re: Good method for selecting observations based on min or max date

Use a subquery:

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
PG
18 REPLIES 18
Super User

## Re: Good method for selecting observations based on min or max date

Easy and boring with a data step.

Sort by person, date and descending quantity so the maxiumum quantity is first with multiple dates and then take the first record of each person/attribute group.

EDIT: Forgot attribute, added that in.

proc sort data=have;
by personID attribute date descending qt1;
run;

data want;
set have;
by personID attribute date;
if first.attribute;
run;

@supp wrote:

Hello SAS community,

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:

data have;
infile datalines;
input personID \$ attribute1 \$ qty  date yymmdd10.;
format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;

We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).

Desired output:

A m1 20 2019-03-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15

For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.

proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date);
quit;

Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

Tourmaline | Level 20

## Re: Good method for selecting observations based on min or max date

Hi @supp   The way you frame questions is very intriguing and nice. This time, i am just gonna correct your SQL and wait for others to respond with their innovative answers. I am gonna see if I can test myself i.e my own capability to come up with something that others haven't. But mind you, if somebody among the likes of PGstats, Tom, Paul D, Reinhard, John King, Xia Keshan et al etc has answered. I am just gonna take that notes. 🙂 and be in awe.

You missed an equality operator to filter the group having identified the MIN

proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date)=date;
quit;

Pyrite | Level 9

## Re: Good method for selecting observations based on min or max date

Thanks for catching my syntax error. That actually helped a lot!
Pyrite | Level 9

## Re: Good method for selecting observations based on min or max date

Hi @supp   The way you frame questions is very intriguing and nice.

@novinosrin , if you have any suggestions on how to improve asking these questions let me know!

Tourmaline | Level 20

## Re: Good method for selecting observations based on min or max date

You are doing great. We love it!

Opal | Level 21

## Re: Good method for selecting observations based on min or max date

Use a subquery:

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
PG
Tourmaline | Level 20

## Re: Good method for selecting observations based on min or max date

Does this work?

proc sql;
create table want as
select *
from work.have
group by personID, attribute1
having max(qty)=qty and min(date)=date;
quit;
Opal | Level 21

## Re: Good method for selecting observations based on min or max date

@novinosrin This can't always work because the max(qty) might not occur on the first date.

PG
Tourmaline | Level 20

## Re: Good method for selecting observations based on min or max date

Thank you Sir @PGStats  Great catch

Super User

## Re: Good method for selecting observations based on min or max date

It depends on the criteria. I read that if records with multiple minimum dates the first time I coded it, or it could be the max if multiple overall which does change the question a bit.
Pyrite | Level 9

## Re: Good method for selecting observations based on min or max date

@PGStats, you are correct. @novinosrin, I think the solution worked in the example I provided because all my scenarios had the max qty on the earliest date per person per attribute. I added person E to the data such that the max qty is not on the earliest date:

data have;
infile datalines;
input personID \$ attribute1 \$ qty  date yymmdd10.;
format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 10 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
E m2 30 2019-02-01
E m2 20 2019-01-01
;
run;

proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date)=date and qty=max(qty);
quit;

We don't return any observations for person E. Want to return the observation with the earliest date:

A	m1	20	2019-02-01
A	m2	20	2019-01-01
A	m3	20	2019-02-01
B	m3	20	2019-02-15
C	m2	10	2019-01-15
D	m1	20	2019-01-15
Tourmaline | Level 20

## Re: Good method for selecting observations based on min or max date

That's the reason i don't like PG because he is Prodigygeniusstats aka pgstats 🙂

## Re: Good method for selecting observations based on min or max date

Hi @supp,

A third competitor besides PROC SQL (using GROUP BY, HAVING and nested subqueries) and the combination of PROC SORT and a DATA step (using BY and FIRST.by-variable or LAST.by-variable) is PROC SUMMARY (or its twin PROC MEANS):

proc summary data=have nway;
class personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

If the input dataset is sorted by the grouping variables, which is the case in your sample data, you can replace the CLASS statement with a BY statement and omit the NWAY option:

proc summary data=have;
by personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

I always have difficulties remembering the syntax of the IDGROUP (alias IDGRP) specification, but actually it's quite logical: Within the BY group (or combination of CLASS variable values) take the min(date) and, if there are ties, use max(qty) as a tiebreaker (and if there are still ties left, use the first of the tied observations in the input dataset -- or the last if the keyword LAST is inserted before OUT). The list of ID variables, (date qty), could include more variables if desired (e.g. all remaining variables of HAVE, if any).

Like the PROC SORT/DATA step approach PROC SUMMARY guarantees that only one observation per group will be included in the output dataset (see tiebreaking rules), whereas the HAVING clause of PROC SQL is prone to include tied observations (which may or may not be desirable, depending on the task).

It's also easily extendable to more than just two ID variables (date, qty). PROC SQL, however, would require additional levels of nested subqueries because the minimum/maximum specifications refer to different groupings and therefore require different GROUP BY clauses.

PROC SUMMARY shares with PROC SQL the advantages of doing everything in one step and not requiring sorted/indexed input data.

(I think the first date in the "desired output" in your initial post should read 2019-02-01.)

Ammonite | Level 13

## Re: Good method for selecting observations based on min or max date

From your description, it looks as though the date in the first output record should rather be 2019-02-01, as it precedes 2019-03-01 for the same ID "A" and attribute "m1".

Discussion stats
• 18 replies
• 3463 views
• 12 likes
• 6 in conversation