Hello SAS community,
I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:
data have;
infile datalines;
input personID $ attribute1 $ qty date yymmdd10.;
format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;
We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).
Desired output:
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.
proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date);
quit;
Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?
Thanks in advance!
edit: fixed first obs in desired output
Use a subquery:
proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
Easy and boring with a data step.
Sort by person, date and descending quantity so the maxiumum quantity is first with multiple dates and then take the first record of each person/attribute group.
EDIT: Forgot attribute, added that in.
proc sort data=have;
by personID attribute date descending qt1;
run;
data want;
set have;
by personID attribute date;
if first.attribute;
run;
@supp wrote:
Hello SAS community,
I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:
data have; infile datalines; input personID $ attribute1 $ qty date yymmdd10.; format date yymmdd10.; datalines; A m1 10 2019-03-01 A m1 20 2019-03-01 A m1 20 2019-02-01 A m2 20 2019-01-01 A m3 20 2019-02-01 B m3 10 2019-02-15 B m3 20 2019-02-15 C m2 20 2019-01-15 D m1 20 2019-01-15 D m1 20 2019-02-15 ; run;
We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty).
Desired output:
A m1 20 2019-03-01 A m2 20 2019-01-01 A m3 20 2019-02-01 B m3 20 2019-02-15 C m2 20 2019-01-15 D m1 20 2019-01-15
For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data.
proc sql; /** return one row per person and attribute **/ create table want1 as select * from work.have group by personID, attribute1 having min(date); quit;
Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?
Thanks in advance!
Hi @supp The way you frame questions is very intriguing and nice. This time, i am just gonna correct your SQL and wait for others to respond with their innovative answers. I am gonna see if I can test myself i.e my own capability to come up with something that others haven't. But mind you, if somebody among the likes of PGstats, Tom, Paul D, Reinhard, John King, Xia Keshan et al etc has answered. I am just gonna take that notes. 🙂 and be in awe.
You missed an equality operator to filter the group having identified the MIN
proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date)=date;
quit;
Hi @supp The way you frame questions is very intriguing and nice.
@novinosrin , if you have any suggestions on how to improve asking these questions let me know!
You are doing great. We love it!
Use a subquery:
proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
Does this work?
proc sql;
create table want as
select *
from work.have
group by personID, attribute1
having max(qty)=qty and min(date)=date;
quit;
@novinosrin This can't always work because the max(qty) might not occur on the first date.
Thank you Sir @PGStats Great catch
@PGStats, you are correct. @novinosrin, I think the solution worked in the example I provided because all my scenarios had the max qty on the earliest date per person per attribute. I added person E to the data such that the max qty is not on the earliest date:
data have;
infile datalines;
input personID $ attribute1 $ qty date yymmdd10.;
format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 10 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
E m2 30 2019-02-01
E m2 20 2019-01-01
;
run;
proc sql;
/** return one row per person and attribute **/
create table want1 as
select *
from work.have
group by personID, attribute1
having min(date)=date and qty=max(qty);
quit;
We don't return any observations for person E. Want to return the observation with the earliest date:
A m1 20 2019-02-01 A m2 20 2019-01-01 A m3 20 2019-02-01 B m3 20 2019-02-15 C m2 10 2019-01-15 D m1 20 2019-01-15
That's the reason i don't like PG because he is Prodigygeniusstats aka pgstats 🙂
Hi @supp,
A third competitor besides PROC SQL (using GROUP BY, HAVING and nested subqueries) and the combination of PROC SORT and a DATA step (using BY and FIRST.by-variable or LAST.by-variable) is PROC SUMMARY (or its twin PROC MEANS):
proc summary data=have nway;
class personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;
If the input dataset is sorted by the grouping variables, which is the case in your sample data, you can replace the CLASS statement with a BY statement and omit the NWAY option:
proc summary data=have;
by personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;
I always have difficulties remembering the syntax of the IDGROUP (alias IDGRP) specification, but actually it's quite logical: Within the BY group (or combination of CLASS variable values) take the min(date) and, if there are ties, use max(qty) as a tiebreaker (and if there are still ties left, use the first of the tied observations in the input dataset -- or the last if the keyword LAST is inserted before OUT). The list of ID variables, (date qty), could include more variables if desired (e.g. all remaining variables of HAVE, if any).
Like the PROC SORT/DATA step approach PROC SUMMARY guarantees that only one observation per group will be included in the output dataset (see tiebreaking rules), whereas the HAVING clause of PROC SQL is prone to include tied observations (which may or may not be desirable, depending on the task).
It's also easily extendable to more than just two ID variables (date, qty). PROC SQL, however, would require additional levels of nested subqueries because the minimum/maximum specifications refer to different groupings and therefore require different GROUP BY clauses.
PROC SUMMARY shares with PROC SQL the advantages of doing everything in one step and not requiring sorted/indexed input data.
(I think the first date in the "desired output" in your initial post should read 2019-02-01.)
From your description, it looks as though the date in the first output record should rather be 2019-02-01, as it precedes 2019-03-01 for the same ID "A" and attribute "m1".
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.