BookmarkSubscribeRSS Feed
apple
Calcite | Level 5
HH_NoPI_NoX
01011
01025
02012
03013

With a dataset similar (but with many more rows) to the above, I want to select

for each HH_No

the PI_No  with the highest X.

The output should show the HH_No, PI_No with the highest X & the corresponding X.

Thanks

8 REPLIES 8
Keith
Obsidian | Level 7

data have;

input HH_No PI_No X;

format HH_No PI_No z2.;

datalines;

01 01 1

01 02 5

02 01 2

03 01 3

;

run;

proc summary data=have nway;

class HH_No;

output out=want (drop=_:) maxid(X(PI_No X))=;

run;

apple
Calcite | Level 5

Thanks for the reply. But I don't understand this line:

output out=want (drop=_:) maxid(X(PI_No X))=;

Specifically, I don't understand

_:

(X(PI_No X))

Why is it blank after the = in "maxid(X(PI_No X))=;"?

Thanks

Keith
Obsidian | Level 7

'maxid(X( PI_No X )) =' says to return PI_No and X for the highest value of X per group.


The = sign afterwards just enables new variable names to be created if desired,

e.g. maxid(X( PI_No X )) = newvar1 newvar2;


Leaving it blank retains the existing variables names (PI_No X)

pradeepalankar
Obsidian | Level 7

proc sql;

select * from have group by HH_No

having x=max(x);

quit;

Vince28_Statcan
Quartz | Level 8

The issue although it may also be the desired result as it was not specified, is that if you have 2 distinct PI_NO with the same x=max(x) value, you will output both with that proc sql approach.

Similarly, depending on the size of your data, using proc summary instead of a data step or sql solution could take far more time than it should.

Again, depending on the size of your data and whether it is already pre-sorted by HH_NO or not, there are hash solutions, by group processing with point= solution and others that could get you faster processing and solve the proc sql potential duplicate HH_NO issue.

So anyway if the solutions above run too slow or don't fully comply with your requirement, I will gladly provide either a hash or by processing solution but just in case you did not have such a huge dataset and either the above worked, I'll wait for a reply to provide an alternative solution.

apple
Calcite | Level 5

Hi

Thank you for explaining things. I am new to SAS and will like to know what is "a hash or by processing solution".

Thank you

PGStats
Opal | Level 21

A BY processing solution is a datastep where you take advantage of SAS automatic variables that signal the first and last observations of each BY group. It is very efficient but requires the input dataset to be sorted by the BY variables. For your simple task, it would look like this:

data have;
input HH_No PI_No X;
format HH_No PI_No z2.;
datalines;
01 01 1
01 02 5
02 01 2
03 01 3
;

proc sort data=have; by HH_No X; run; /* not required if dataset is already sorted */

data want;
set have; by HH_No;
if last.HH_No; /* keep last observation of each BY group */
run;

proc print data=want noobs; run;

PG

PG
Vince28_Statcan
Quartz | Level 8

As PG has examplified by-group processing, I will only answer the hash portion.

The SAS Hash Object is a fairly recent feature. It is essentially a "lookup table" that is usable through the data step. The strength of Hashing for the specific question you've asked is that if your data wasn't already sorted according to your by-group, then a hash solution alleviates the need to pre-sort (which often is the lenghtiest part of a simple by-group process). However, the hash object is fully held in memory throughout the data step so if you have a very, very large volume of data, it is possible to face an "out of memory" error with hash solutions.

Syntaxically, it also looks confusing at first glance but one gets familiar with the programing syntax quite fast. Since you mentioned you were fairly new to SAS, it is probably best to get accustomed at least to by-group processing before looking at hashing. Hashing can achieve far more than just this but since it is memory-limited, it is wiser to learn the processing tools that should work "all around". So I will refrain from providing a hash solution. There are plenty of similar examples around these forums anyway and in various SGF papers.

As to add one element to Keith's description above, procedures have various built-in output. Unless tasked otherwise, the output statement in proc summary outputs outputs many more statistics than just maximum. Using output statement options allows you to control that default behaviour. See proc summary -> output statement documentation (it will actually point you towards proc means' output statement as they have the same options). You will find the specific explanation for the MAXID option there and how it applies the max on X, by class HH_NO but will actually output PI_No and X as the identifier for that maximum point found.

Vincent

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1915 views
  • 1 like
  • 5 in conversation