My dataset is shown as below,
case_number | case_value |
a | 1 |
a | 2 |
a | 3 |
a | 4 |
b | 2 |
b | 3 |
b | 4 |
b | 5 |
c | 5 |
c | 6 |
c | 7 |
c | 9 |
I want to extract the observations whose case _value is maximal across each case_number. For this dataset, I want to get one subset like this.
case_number | case_value |
a | 4 |
b | 5 |
c | 9 |
How to achieve this by coding?
Thanks!
data have;
input case_number $ case_value;
datalines;
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 5
c 6
c 7
c 9
;
run;
proc sql;
create table want as
select distinct case_number,
max(case_value) as case_value
from have
group by case_number;
quit;
Why use DISTINCT in this query?
PG
Thanks for the question since it made me think what's happening with proc sql calls.
It will return the same result set without DISTINCT in select statement with one less operation.
I looked at the method chosen by proc sql with and without DISTINCT keyword in select statement.
When DISTINCT is used "sqxunqs" operation is run as an extra operation to de-dup records which are already unique before this operation.
I should be more careful next time when using DISTINCT keyword especially with large data sets.
Method without DISTINCT:
proc sql _method;
create table want as
select case_number,
max(case_value) as case_value
from have
group by case_number;
quit;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxsumg
sqxsort
sqxsrc( WORK.HAVE )
Method with DISTINCT:
proc sql _method;
create table want as
select distinct case_number,
max(case_value) as case_value
from have
group by case_number;
quit;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxunqs
sqxsumg
sqxsort
sqxsrc( WORK.HAVE )
Another option is Proc Summary. The 2nd example is based on the maximum value being returned when an ID statement is used.
proc summary data=have nway;
class case_number;
var case_value;
output out=want (drop=_:) max=;
run;
or
proc summary data=have nway;
class case_number;
id case_value;
output out=want (drop=_:);
run;
or no SQL solution:
data have;
input case_number $ case_value;
datalines;
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 5
c 6
c 7
c 9
;
proc sort;
by case_number $ case_value;
data want;
set have;
by case_number;
if last.case_number;
proc print;run;
Linlin
hi ... the SQL and SUMMARY solutions seem most "economical", but if you are going to sort, you can always sort again and take advantage of the default EQUALS option that maintains order withing sort groups ...
proc sort data=have out=want;
by case_number descending case_value;
run;
proc sort data=want nodupkey;
by case_number;
run;
Hi Mike,
Thank you! I need to practice my Mapping skill learned from you. I am going to post a question at SAS/Graph and ODS Graphics. I am sure you will help me.
- Linlin
FWIW, a hash() solution:
data have;
input case_number$ case_value;
cards;
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 5
c 6
c 7
c 9
;
data _null_;
if 0 then set have;
dcl hash h(dataset:'have', ordered: 'a',duplicate:'r');
h.definekey('case_number');
h.definedata(all:'y');
h.definedone();
rc=h.output(dataset:'want');
stop;
run;
proc print;run;
Haikuo
Hi Haikuo,
I am totally new to hash. Hash looks like a vey handy appraoch.
Coud you please elaborate a bit how you have built the logic in your hash code (time permitting).
This simple data set looks a good opportunity to understand it.
Thanks
Mirisage
Hi Mirisage,
Hash() object is actually more straightforward than many other data step statements, such as 'set', 'merge', 'update', it is just a little different. And you are correct, Above example is ideal to take a peek into hash():
data _null_;
if 0 then set have; /*This is to set hash() environment, eg variables, before you can define a hash(), in this case, you can also use: call missing (case_number, case_value);*/
dcl hash h(dataset:'have', ordered: 'a',duplicate:'r'); /*This is to define hash(), named 'h', loaded from 'have', with an order 'a' as 'ascending', if there are duplicates by 'key', then only keep the last record ('r' denotes replace)*/
h.definekey('case_number'); /*Define search key 'case_number'*/
h.definedata(all:'y'); /*Define data element, all :'y' denoting with every thing from 'have'
h.definedone(); /*the end of hash() definition*/
rc=h.output(dataset:'want'); /output hash 'h' to dataset 'want'*/
stop;
run;
Here is the help link, where I 've learned most of my hash():
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002576871.htm
Also there are tons of sugi paper on hash(), pay extra attention on those written by Paul dorfman.
Regards,
Haikuo
Hi Haikuo,
Thank you very much for your time and expertise.
Learning resources and your annotated comments are very much appreciated.
Best regards
Mirisage
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.