HH_No | PI_No | X |
---|---|---|
01 | 01 | 1 |
01 | 02 | 5 |
02 | 01 | 2 |
03 | 01 | 3 |
With a dataset similar (but with many more rows) to the above, I want to select
for each HH_No
the PI_No with the highest X.
The output should show the HH_No, PI_No with the highest X & the corresponding X.
Thanks
data have;
input HH_No PI_No X;
format HH_No PI_No z2.;
datalines;
01 01 1
01 02 5
02 01 2
03 01 3
;
run;
proc summary data=have nway;
class HH_No;
output out=want (drop=_:) maxid(X(PI_No X))=;
run;
Thanks for the reply. But I don't understand this line:
output out=want (drop=_:) maxid(X(PI_No X))=;
Specifically, I don't understand
_:
(X(PI_No X))
Why is it blank after the = in "maxid(X(PI_No X))=;"?
Thanks
'maxid(X( PI_No X )) =' says to return PI_No and X for the highest value of X per group.
The = sign afterwards just enables new variable names to be created if desired,
e.g. maxid(X( PI_No X )) = newvar1 newvar2;
Leaving it blank retains the existing variables names (PI_No X)
proc sql;
select * from have group by HH_No
having x=max(x);
quit;
The issue although it may also be the desired result as it was not specified, is that if you have 2 distinct PI_NO with the same x=max(x) value, you will output both with that proc sql approach.
Similarly, depending on the size of your data, using proc summary instead of a data step or sql solution could take far more time than it should.
Again, depending on the size of your data and whether it is already pre-sorted by HH_NO or not, there are hash solutions, by group processing with point= solution and others that could get you faster processing and solve the proc sql potential duplicate HH_NO issue.
So anyway if the solutions above run too slow or don't fully comply with your requirement, I will gladly provide either a hash or by processing solution but just in case you did not have such a huge dataset and either the above worked, I'll wait for a reply to provide an alternative solution.
Hi
Thank you for explaining things. I am new to SAS and will like to know what is "a hash or by processing solution".
Thank you
A BY processing solution is a datastep where you take advantage of SAS automatic variables that signal the first and last observations of each BY group. It is very efficient but requires the input dataset to be sorted by the BY variables. For your simple task, it would look like this:
data have;
input HH_No PI_No X;
format HH_No PI_No z2.;
datalines;
01 01 1
01 02 5
02 01 2
03 01 3
;
proc sort data=have; by HH_No X; run; /* not required if dataset is already sorted */
data want;
set have; by HH_No;
if last.HH_No; /* keep last observation of each BY group */
run;
proc print data=want noobs; run;
PG
As PG has examplified by-group processing, I will only answer the hash portion.
The SAS Hash Object is a fairly recent feature. It is essentially a "lookup table" that is usable through the data step. The strength of Hashing for the specific question you've asked is that if your data wasn't already sorted according to your by-group, then a hash solution alleviates the need to pre-sort (which often is the lenghtiest part of a simple by-group process). However, the hash object is fully held in memory throughout the data step so if you have a very, very large volume of data, it is possible to face an "out of memory" error with hash solutions.
Syntaxically, it also looks confusing at first glance but one gets familiar with the programing syntax quite fast. Since you mentioned you were fairly new to SAS, it is probably best to get accustomed at least to by-group processing before looking at hashing. Hashing can achieve far more than just this but since it is memory-limited, it is wiser to learn the processing tools that should work "all around". So I will refrain from providing a hash solution. There are plenty of similar examples around these forums anyway and in various SGF papers.
As to add one element to Keith's description above, procedures have various built-in output. Unless tasked otherwise, the output statement in proc summary outputs outputs many more statistics than just maximum. Using output statement options allows you to control that default behaviour. See proc summary -> output statement documentation (it will actually point you towards proc means' output statement as they have the same options). You will find the specific explanation for the MAXID option there and how it applies the max on X, by class HH_NO but will actually output PI_No and X as the identifier for that maximum point found.
Vincent
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.