Select observations that meet a criteria

Reply
Frequent Contributor
Posts: 75

Select observations that meet a criteria

HH_NoPI_NoX
01011
01025
02012
03013

With a dataset similar (but with many more rows) to the above, I want to select

for each HH_No

the PI_No  with the highest X.

The output should show the HH_No, PI_No with the highest X & the corresponding X.

Thanks

Regular Contributor
Posts: 151

Re: Select observations that meet a criteria

data have;

input HH_No PI_No X;

format HH_No PI_No z2.;

datalines;

01 01 1

01 02 5

02 01 2

03 01 3

;

run;

proc summary data=have nway;

class HH_No;

output out=want (drop=_Smiley Happy maxid(X(PI_No X))=;

run;

Frequent Contributor
Posts: 75

Re: Select observations that meet a criteria

Thanks for the reply. But I don't understand this line:

output out=want (drop=_Smiley Happy maxid(X(PI_No X))=;

Specifically, I don't understand

_:

(X(PI_No X))

Why is it blank after the = in "maxid(X(PI_No X))=;"?

Thanks

Regular Contributor
Posts: 151

Re: Select observations that meet a criteria

'maxid(X( PI_No X )) =' says to return PI_No and X for the highest value of X per group.


The = sign afterwards just enables new variable names to be created if desired,

e.g. maxid(X( PI_No X )) = newvar1 newvar2;


Leaving it blank retains the existing variables names (PI_No X)

Frequent Contributor
Posts: 106

Re: Select observations that meet a criteria

proc sql;

select * from have group by HH_No

having x=max(x);

quit;

Super Contributor
Posts: 339

Re: Select observations that meet a criteria

Posted in reply to pradeepalankar

The issue although it may also be the desired result as it was not specified, is that if you have 2 distinct PI_NO with the same x=max(x) value, you will output both with that proc sql approach.

Similarly, depending on the size of your data, using proc summary instead of a data step or sql solution could take far more time than it should.

Again, depending on the size of your data and whether it is already pre-sorted by HH_NO or not, there are hash solutions, by group processing with point= solution and others that could get you faster processing and solve the proc sql potential duplicate HH_NO issue.

So anyway if the solutions above run too slow or don't fully comply with your requirement, I will gladly provide either a hash or by processing solution but just in case you did not have such a huge dataset and either the above worked, I'll wait for a reply to provide an alternative solution.

Frequent Contributor
Posts: 75

Re: Select observations that meet a criteria

Posted in reply to Vince28_Statcan

Hi

Thank you for explaining things. I am new to SAS and will like to know what is "a hash or by processing solution".

Thank you

Respected Advisor
Posts: 4,925

Re: Select observations that meet a criteria

A BY processing solution is a datastep where you take advantage of SAS automatic variables that signal the first and last observations of each BY group. It is very efficient but requires the input dataset to be sorted by the BY variables. For your simple task, it would look like this:

data have;
input HH_No PI_No X;
format HH_No PI_No z2.;
datalines;
01 01 1
01 02 5
02 01 2
03 01 3
;

proc sort data=have; by HH_No X; run; /* not required if dataset is already sorted */

data want;
set have; by HH_No;
if last.HH_No; /* keep last observation of each BY group */
run;

proc print data=want noobs; run;

PG

PG
Super Contributor
Posts: 339

Re: Select observations that meet a criteria

As PG has examplified by-group processing, I will only answer the hash portion.

The SAS Hash Object is a fairly recent feature. It is essentially a "lookup table" that is usable through the data step. The strength of Hashing for the specific question you've asked is that if your data wasn't already sorted according to your by-group, then a hash solution alleviates the need to pre-sort (which often is the lenghtiest part of a simple by-group process). However, the hash object is fully held in memory throughout the data step so if you have a very, very large volume of data, it is possible to face an "out of memory" error with hash solutions.

Syntaxically, it also looks confusing at first glance but one gets familiar with the programing syntax quite fast. Since you mentioned you were fairly new to SAS, it is probably best to get accustomed at least to by-group processing before looking at hashing. Hashing can achieve far more than just this but since it is memory-limited, it is wiser to learn the processing tools that should work "all around". So I will refrain from providing a hash solution. There are plenty of similar examples around these forums anyway and in various SGF papers.

As to add one element to Keith's description above, procedures have various built-in output. Unless tasked otherwise, the output statement in proc summary outputs outputs many more statistics than just maximum. Using output statement options allows you to control that default behaviour. See proc summary -> output statement documentation (it will actually point you towards proc means' output statement as they have the same options). You will find the specific explanation for the MAXID option there and how it applies the max on X, by class HH_NO but will actually output PI_No and X as the identifier for that maximum point found.

Vincent

Ask a Question
Discussion stats
  • 8 replies
  • 437 views
  • 1 like
  • 5 in conversation