BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
supp
Pyrite | Level 9

Hello SAS community,

 

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:

data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;

We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty). 

 

Desired output:

A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15

For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data. 

 

proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date);
quit;

Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

 

Thanks in advance!

 

edit: fixed first obs in desired output

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Use a subquery:

 

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
PG

View solution in original post

18 REPLIES 18
Reeza
Super User

Easy and boring with a data step.

Sort by person, date and descending quantity so the maxiumum quantity is first with multiple dates and then take the first record of each person/attribute group.


EDIT: Forgot attribute, added that in. 

 

proc sort data=have;
by personID attribute date descending qt1;
run;

data want;
set have;
by personID attribute date;
if first.attribute;
run;

@supp wrote:

Hello SAS community,

 

I have come across this scenario a few times and don't feel I have found or created a good solution. Lets say we have data as such:

data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
;
run;

We can have multiple observations per person per attribute. We want to return one observation per person per attribute with the earliest date, min(date). In the event there are multiple observations for the same person and attribute then select for the largest qty, max(qty). 

 

Desired output:

A m1 20 2019-03-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 20 2019-02-15
C m2 20 2019-01-15
D m1 20 2019-01-15

For starters I tried this to return one row per person per attribute with the earliest date, but it doesn't work. As expected I get the note: NOTE: The query requires remerging summary statistics back with the original data. 

 

proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date);
quit;

Is there a simple solution for this that I am not seeing? Is SQL the best approach? Or is there a better approach using a data step?

 

Thanks in advance!


 

novinosrin
Tourmaline | Level 20

Hi @supp   The way you frame questions is very intriguing and nice. This time, i am just gonna correct your SQL and wait for others to respond with their innovative answers. I am gonna see if I can test myself i.e my own capability to come up with something that others haven't. But mind you, if somebody among the likes of PGstats, Tom, Paul D, Reinhard, John King, Xia Keshan et al etc has answered. I am just gonna take that notes. 🙂 and be in awe.

 

You missed an equality operator to filter the group having identified the MIN

 

proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date)=date;
quit;

 

 

 

 

supp
Pyrite | Level 9
Thanks for catching my syntax error. That actually helped a lot!
supp
Pyrite | Level 9

Hi @supp   The way you frame questions is very intriguing and nice.

@novinosrin , if you have any suggestions on how to improve asking these questions let me know! 

novinosrin
Tourmaline | Level 20

You are doing great. We love it!

PGStats
Opal | Level 21

Use a subquery:

 

proc sql;
create table want as
select *
from (select * from have group by personId, attribute1 having date = min(date))
group by personId, attribute1, date
having qty = max(qty);
quit;
PG
novinosrin
Tourmaline | Level 20

Does this work?

 

proc sql;
create table want as
 select *
	from work.have
     group by personID, attribute1
	having max(qty)=qty and min(date)=date;
quit;
PGStats
Opal | Level 21

@novinosrin This can't always work because the max(qty) might not occur on the first date.

PG
novinosrin
Tourmaline | Level 20

Thank you Sir @PGStats  Great catch 

Reeza
Super User
It depends on the criteria. I read that if records with multiple minimum dates the first time I coded it, or it could be the max if multiple overall which does change the question a bit.
supp
Pyrite | Level 9

@PGStats, you are correct. @novinosrin, I think the solution worked in the example I provided because all my scenarios had the max qty on the earliest date per person per attribute. I added person E to the data such that the max qty is not on the earliest date:

data have;
	infile datalines;
		input personID $ attribute1 $ qty  date yymmdd10.;
		format date yymmdd10.;
datalines;
A m1 10 2019-03-01
A m1 20 2019-03-01
A m1 20 2019-02-01
A m2 20 2019-01-01
A m3 20 2019-02-01
B m3 10 2019-02-15
B m3 20 2019-02-15
C m2 10 2019-01-15
D m1 20 2019-01-15
D m1 20 2019-02-15
E m2 30 2019-02-01
E m2 20 2019-01-01
;
run;


proc sql;
	/** return one row per person and attribute **/
	create table want1 as
		select *
		from work.have
		group by personID, attribute1
		having min(date)=date and qty=max(qty);
quit;

 

We don't return any observations for person E. Want to return the observation with the earliest date:

A	m1	20	2019-02-01
A	m2	20	2019-01-01
A	m3	20	2019-02-01
B	m3	20	2019-02-15
C	m2	10	2019-01-15
D	m1	20	2019-01-15
novinosrin
Tourmaline | Level 20

That's the reason i don't like PG because he is Prodigygeniusstats aka pgstats 🙂

FreelanceReinh
Jade | Level 19

Hi @supp,

 

A third competitor besides PROC SQL (using GROUP BY, HAVING and nested subqueries) and the combination of PROC SORT and a DATA step (using BY and FIRST.by-variable or LAST.by-variable) is PROC SUMMARY (or its twin PROC MEANS):

proc summary data=have nway;
class personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

If the input dataset is sorted by the grouping variables, which is the case in your sample data, you can replace the CLASS statement with a BY statement and omit the NWAY option:

proc summary data=have;
by personID attribute1;
output out=want(drop=_:) idgrp(min(date) max(qty) out (date qty)=);
run;

I always have difficulties remembering the syntax of the IDGROUP (alias IDGRP) specification, but actually it's quite logical: Within the BY group (or combination of CLASS variable values) take the min(date) and, if there are ties, use max(qty) as a tiebreaker (and if there are still ties left, use the first of the tied observations in the input dataset -- or the last if the keyword LAST is inserted before OUT). The list of ID variables, (date qty), could include more variables if desired (e.g. all remaining variables of HAVE, if any).

 

Like the PROC SORT/DATA step approach PROC SUMMARY guarantees that only one observation per group will be included in the output dataset (see tiebreaking rules), whereas the HAVING clause of PROC SQL is prone to include tied observations (which may or may not be desirable, depending on the task).

 

It's also easily extendable to more than just two ID variables (date, qty). PROC SQL, however, would require additional levels of nested subqueries because the minimum/maximum specifications refer to different groupings and therefore require different GROUP BY clauses.

 

PROC SUMMARY shares with PROC SQL the advantages of doing everything in one step and not requiring sorted/indexed input data. 

 

(I think the first date in the "desired output" in your initial post should read 2019-02-01.)

hashman
Ammonite | Level 13

@supp:

From your description, it looks as though the date in the first output record should rather be 2019-02-01, as it precedes 2019-03-01 for the same ID "A" and attribute "m1". 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 3608 views
  • 12 likes
  • 6 in conversation