Re: Select observations that meet a criteria

apple · Posted 01-03-2014 04:27 AM

HH_No	PI_No	X
01	01	1
01	02	5
02	01	2
03	01	3

With a dataset similar (but with many more rows) to the above, I want to select

for each HH_No

the PI_No with the highest X.

The output should show the HH_No, PI_No with the highest X & the corresponding X.

Thanks

Keith · Posted 01-03-2014 05:28 AM

data have;

input HH_No PI_No X;

format HH_No PI_No z2.;

datalines;

01 01 1

01 02 5

02 01 2

03 01 3

;

run;

proc summary data=have nway;

class HH_No;

output out=want (drop=_:) maxid(X(PI_No X))=;

run;

apple · Posted 01-04-2014 03:31 AM

Thanks for the reply. But I don't understand this line:

output out=want (drop=_:) maxid(X(PI_No X))=;

Specifically, I don't understand

_:

(X(PI_No X))

Why is it blank after the = in "maxid(X(PI_No X))=;"?

Thanks

Keith · Posted 01-06-2014 11:08 AM

'maxid(X( PI_No X )) =' says to return PI_No and X for the highest value of X per group.

The = sign afterwards just enables new variable names to be created if desired,

e.g. maxid(X( PI_No X )) = newvar1 newvar2;

Leaving it blank retains the existing variables names (PI_No X)

pradeepalankar · Posted 01-03-2014 08:42 AM

proc sql;

select * from have group by HH_No

having x=max(x);

quit;

Vince28_Statcan · Posted 01-03-2014 09:13 AM

The issue although it may also be the desired result as it was not specified, is that if you have 2 distinct PI_NO with the same x=max(x) value, you will output both with that proc sql approach.

Similarly, depending on the size of your data, using proc summary instead of a data step or sql solution could take far more time than it should.

Again, depending on the size of your data and whether it is already pre-sorted by HH_NO or not, there are hash solutions, by group processing with point= solution and others that could get you faster processing and solve the proc sql potential duplicate HH_NO issue.

So anyway if the solutions above run too slow or don't fully comply with your requirement, I will gladly provide either a hash or by processing solution but just in case you did not have such a huge dataset and either the above worked, I'll wait for a reply to provide an alternative solution.

apple · Posted 01-04-2014 03:33 AM

Hi

Thank you for explaining things. I am new to SAS and will like to know what is "a hash or by processing solution".

Thank you

PGStats · Posted 01-04-2014 10:45 PM

A BY processing solution is a datastep where you take advantage of SAS automatic variables that signal the first and last observations of each BY group. It is very efficient but requires the input dataset to be sorted by the BY variables. For your simple task, it would look like this:

data have;
input HH_No PI_No X;
format HH_No PI_No z2.;
datalines;
01 01 1
01 02 5
02 01 2
03 01 3
;

proc sort data=have; by HH_No X; run; /* not required if dataset is already sorted */

data want;
set have; by HH_No;
if last.HH_No; /* keep last observation of each BY group */
run;

proc print data=want noobs; run;

PG

Vince28_Statcan · Posted 01-06-2014 12:32 PM

As PG has examplified by-group processing, I will only answer the hash portion.

The SAS Hash Object is a fairly recent feature. It is essentially a "lookup table" that is usable through the data step. The strength of Hashing for the specific question you've asked is that if your data wasn't already sorted according to your by-group, then a hash solution alleviates the need to pre-sort (which often is the lenghtiest part of a simple by-group process). However, the hash object is fully held in memory throughout the data step so if you have a very, very large volume of data, it is possible to face an "out of memory" error with hash solutions.

Syntaxically, it also looks confusing at first glance but one gets familiar with the programing syntax quite fast. Since you mentioned you were fairly new to SAS, it is probably best to get accustomed at least to by-group processing before looking at hashing. Hashing can achieve far more than just this but since it is memory-limited, it is wiser to learn the processing tools that should work "all around". So I will refrain from providing a hash solution. There are plenty of similar examples around these forums anyway and in various SGF papers.

As to add one element to Keith's description above, procedures have various built-in output. Unless tasked otherwise, the output statement in proc summary outputs outputs many more statistics than just maximum. Using output statement options allows you to control that default behaviour. See proc summary -> output statement documentation (it will actually point you towards proc means' output statement as they have the same options). You will find the specific explanation for the MAXID option there and how it applies the max on X, by class HH_NO but will actually output PI_No and X as the identifier for that maximum point found.

Vincent

Classroom Training Available!