Calcite | Level 5

How to keep specific rows, by frequency

Hello all, I am trying to write a code where I take my data, put it into a table, and then just make a table for the top 5 and bottom 5 values by frequency.

My code is:

``````Proc Freq data=db1;
table statefips / nocum;
run;

Proc Sort data=db1;
by statefips;
run;``````

My Table

StateCode Frequency

 26357 1.47 4326 0.24 35139 1.97 15870 0.89 205357 11.49 31240 1.75 22006 1.23 5115 0.29 3871 0.22 106730 5.97 54726 3.06 8187 0.46 9065 0.51 75141 4.2 38683 2.16 19039 1.06 17154 0.96 24458 1.37 24895 1.39 7944 0.44 35711 2 40991 2.29 56479 3.16 33679 1.88 15439 0.86 35793 2 6046 0.34 11315 0.63 15939 0.89 8566 0.48 51896 2.9 10674 0.6 110102 6.16 55570 3.11 4396 0.25 68905 3.85 21246 1.19 22879 1.28 74577 4.17 6383 0.36 26927 1.51 5100 0.29 36552 2.04 139158 7.78 15825 0.89 4237 0.24 48324 2.7 40337 2.26 9658 0.54 36436 2.04 3579 0.2

I really never learned how to pull specific values out of a table to keep, so I'm a little stumped. If anyone can help me with this, I would greatly appreciate it.

2 REPLIES 2
Opal | Level 21

Re: How to keep specific rows, by frequency

If your table is sorted by FREQ then just picking the top 3 and bottom 3 rows is not that hard. But.... how do you intend to deal with ties?

In below data would you want to pick the row where category="D"?

``````data have;
input category \$ freq;
datalines;
A 10
B 9
C 9
D 8
E 7
F 6
G 5
H 4
I 3
;``````

Using above data the code to just pick the first and last 3 rows could look like below.

``````proc sort data=have;
by descending freq;
run;

data want;
set have nobs=nrows;
by descending freq;
if _n_ <=3 or _n_>nrows-3;
run;

proc print data=want;
run;``````

As you can see the row with the 3rd highest value hasn't been selected because it was row 4. ...and if you would have 3 rows with a 2nd highest value then even one of these wouldn't get selected.

Proc Rank can help you dealing with ties. In below code code the rows with the three highest and three lowest values for Freq will get selected.

``````proc rank data=have out=inter ties=dense descending;
var freq;
ranks ranking;
run;
proc sql;
/*  create table want as*/
select *
from inter
having ranking<=3 or ranking>=max(ranking)-2
;
quit;``````

PROC Star

Re: How to keep specific rows, by frequency

You might have ties, so this code keeps tied values at each extreme, as long as at least the 3 lowest and 3 highest observations are output:

``````
proc sort data=have;
by descending freq;
run;

%let min_needed=3;  /* At least 3 highest and 3 lowest, including ties */

data want;
set have nobs=nrows;
by descending freq;
output;

/* If first 3 have been output then stop when ties are exhausted,
then do the same from the bottom up */
if _n_>=&min_needed and last.freq=1 then do p=nrows by -1;
set have point=p;
if p<nrows-&min_needed and freq^=lag(freq) then stop;
output;
end;
run;

proc print data=want;
run;
``````

In the case of a very large dataset, this can be much faster because it doesn't pass through the middle observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Discussion stats
• 2 replies
• 288 views
• 0 likes
• 3 in conversation