## Top 5 Values of Categorical Variable

Occasional Contributor
Posts: 13

# Top 5 Values of Categorical Variable

[ Edited ]

I have a dataset containing ID numbers, university number, and major for students in a certain program.  I need to find the top 5 majors over all the schools and then the corresponding number and % for each of the top 5 majors (overall) by school number.

For example:

 Top 5 majors for students in program Students in program Major #1 Major #2 Major #3 Major #4 Major #5 N N % N % N % N % N % Overall School 1 School 2 School 3 School 4 School 5 School 6 . . .

I am having a hard time due to major being a categorical variable.

I have this code, but it doesn't give me what I am looking for:

proc freq data = have order = freq;

table univNum * Major / noprint out = majors;

run;

proc sort data = majors;

by univNum descending COUNT;

run;

proc means data = majors;

class univNum;

var COUNT;

output out = top5list sum =

idgroup(max(COUNT) out[5] (COUNT)=);

run;

Any advice would be much appreciated.

Super User
Posts: 23,724

## Re: Top 5 Values of Categorical Variable

Isn't your logic missing a step?

You would first need to find the top 5 overall and then filter the percent at each school. It may not be the same...

1. Proc freq for overall - by subject only.

2. Proc freq by subject and school - for correct N and %

3. Filter table 2 by top 5 from step 1

Posts: 4,736

## Re: Top 5 Values of Categorical Variable

To find the overall top5 you will also have to consider how to deal with ties.

Proc Rank gives you options for this so may be use this procedure to determine the overall top5 majors.

Super User
Posts: 13,542

## Re: Top 5 Values of Categorical Variable

And to throw a potential complication is the Percent you are going to report the percent of all majors or the percent of the major among the top 5? I'm not sure if your "overall" means the denominator is total enrollment in the school or not. Careful definition of numerator and denominator in percents really helps.

And would that school enrollment include masters and above enrolled students? Since your top 5 are very likely to be bachelor's degree programs should the masters / doctoral levels be included?

SAS Super FREQ
Posts: 4,241

## Re: Top 5 Values of Categorical Variable

See this blog post. The main trick is to use PROC FREQ with the ORDER=FREQ option. Then just take the first 5 observations  for  each school.

Here's an example to get  you started. HTH.

``````proc freq data=sashelp.cars ORDER=FREQ noprint;
tables origin*type / out=FreqOut;
run;
proc sort data=FreqOut;
by Origin  descending Count;
run;

data Top3;
set FreqOut;
by origin;
if first.origin then
cnt = 1;
else cnt + 1;
if cnt <= 3; /* top 3 */
run;``````

Occasional Contributor
Posts: 13

## Re: Top 5 Values of Categorical Variable

[ Edited ]

I follow your logic which is very helpful, but then my proc tabulate does not give me the proper table I mocked up in my original question.  If anyone has any advice I would appreciate it.

In response to other questions, I want to find the top 5 majors over all universities, then the count of those top 5 majors for each university (undergrad only).

```proc freq data = work.All ORDER = FREQ noprint;
tables Major / out = FreqOut (rename = (count = studentsinmaj));
run;

proc sort data = FreqOut;
by descending studentsinmaj;
run;

*top 5 overall;
data top5;
set FreqOut;
by descending studentsinmaj;
univNum = 1; *want this for overall summary of all universities;
if first.Major then
Cnt = 1;
else Cnt + 1;
if Cnt <= 5;
run;

proc sort data = top5;
by univNum descending studentsinmaj;
run;

*filter school numbers by the top 5 overall languages;
proc freq data = work.All ORDER = FREQ noprint;
tables univNum * Major / out = test (rename = (count = studentsinmaj));
where Major in ('Top major #1 by count on top5', 'Top major #2 by count on top5', 'Top major #3 by count on top5', 'Top major #4 by count on top5');
run;

proc sort data = test;
by univNum descending studentsinmaj;
run;

data combined;
set top5 test;
by univNum descending studentsinmaj;
run;

ODS HTML BODY = 'C:\SAS\primmaj.xls'
style = STATDOC;
proc tabulate data = combined format = 6.1;
class univnum major / missing order = formatted;
var studentsinmaj;
tables (univnum = ''), N ((major = '' * studentsinmaj = '') * (N RowPctN = '%')) / Printmiss;
run;
ODS HTML CLOSE;```

Super User
Posts: 23,724

## Re: Top 5 Values of Categorical Variable

If you post some sample data perhaps we can help with the code.

Occasional Contributor
Posts: 13

## Re: Top 5 Values of Categorical Variable

 StudentNum UnivNum Major 765473521 15 Art 534651321 15 Biology 984651316 15 Biology 534545645 15 Computer Science 145645012 15 Computer Science 645124645 15 Computer Science 878782245 15 Computer Science 456465678 15 Computer Science 347568651 15 Computer Science 789213425 15 Computer Science 243527654 15 Computer Science 156456321 15 History 645541651 15 History 215465654 15 History 126545465 15 History 899875654 15 History 554545111 15 History 545645655 15 Math 651321654 15 Math 985432135 15 Mechanical Engineering 654321354 15 Mechanical Engineering 687613521 15 Mechanical Engineering 765473521 90 Biology 534651321 90 Biology 984651316 90 Biology 534545645 90 Computer Science 145645012 90 Computer Science 645124645 90 Computer Science 878782245 90 Computer Science 456465678 90 Computer Science 347568651 90 Computer Science 789213425 90 History 243527654 90 History 156456321 90 History 645541651 90 History 215465654 90 History 126545465 90 History 899875654 90 History 554545111 90 History 545645655 90 Math 651321654 90 Math 985432135 90 Math 654321354 90 Sociology 687613521 90 Sociology 765473521 645 Chemistry 534651321 645 Chemistry 984651316 645 Chemistry 534545645 645 Computer Science 145645012 645 Computer Science 645124645 645 Computer Science 878782245 645 Computer Science 456465678 645 Electrical Engineering 347568651 645 Electrical Engineering 789213425 645 Electrical Engineering 243527654 645 Electrical Engineering 156456321 645 Electrical Engineering 645541651 645 Electrical Engineering 215465654 645 History 126545465 645 History 899875654 645 History 554545111 645 History 545645655 645 Math 651321654 645 Math 985432135 645 Mechanical Engineering 654321354 645 Mechanical Engineering 687613521 645 Mechanical Engineering
Super User
Posts: 23,724

## Re: Top 5 Values of Categorical Variable

If I understand your problem correctly, I don't think you can use PROC TABULATE since you want to filter the results but still have the percentages reflect the percentages from the University. I think this calculates your data as desired, but doesn't prepare the final table. Hope it helps.

``````data have;
input StudentNum	UnivNum	Major \$50. ;
cards;
765473521	15	Art
534651321	15	Biology
984651316	15	Biology
534545645	15	Computer Science
145645012	15	Computer Science
645124645	15	Computer Science
878782245	15	Computer Science
456465678	15	Computer Science
347568651	15	Computer Science
789213425	15	Computer Science
243527654	15	Computer Science
156456321	15	History
645541651	15	History
215465654	15	History
126545465	15	History
899875654	15	History
554545111	15	History
545645655	15	Math
651321654	15	Math
985432135	15	Mechanical Engineering
654321354	15	Mechanical Engineering
687613521	15	Mechanical Engineering
765473521	90	Biology
534651321	90	Biology
984651316	90	Biology
534545645	90	Computer Science
145645012	90	Computer Science
645124645	90	Computer Science
878782245	90	Computer Science
456465678	90	Computer Science
347568651	90	Computer Science
789213425	90	History
243527654	90	History
156456321	90	History
645541651	90	History
215465654	90	History
126545465	90	History
899875654	90	History
554545111	90	History
545645655	90	Math
651321654	90	Math
985432135	90	Math
654321354	90	Sociology
687613521	90	Sociology
765473521	645	Chemistry
534651321	645	Chemistry
984651316	645	Chemistry
534545645	645	Computer Science
145645012	645	Computer Science
645124645	645	Computer Science
878782245	645	Computer Science
456465678	645	Electrical Engineering
347568651	645	Electrical Engineering
789213425	645	Electrical Engineering
243527654	645	Electrical Engineering
156456321	645	Electrical Engineering
645541651	645	Electrical Engineering
215465654	645	History
126545465	645	History
899875654	645	History
554545111	645	History
545645655	645	Math
651321654	645	Math
985432135	645	Mechanical Engineering
654321354	645	Mechanical Engineering
687613521	645	Mechanical Engineering
;
run;

proc freq data=have order=freq noprint;
table major/out=summary_overall;
run;

data top5;
set summary_overall(obs=5);
run;

proc freq data=have noprint ;
table univnum*major/out=summary_uni (drop=pct_col) outpct ;
run;

proc sql;
create table want as
select *
from summary_uni
where major in (select major from top5);
quit;

``````
Occasional Contributor
Posts: 13

## Re: Top 5 Values of Categorical Variable

Thank you for this helpful code.  Any suggestions for getting the table I want, with the top 5 majors overall and then the counts and percents of those 5 for each school?  I am new to creating Excel tables in SAS.  I have been trying to learn Proc Tabulate.

Super User
Posts: 23,724

## Re: Top 5 Values of Categorical Variable

The WANT table has this information, I'm assuming that all Majors are offered at all Universities and I'm not sure this is true. You may need to account for 0.

To get the table you want, tranpsose the WANT table and use PROC PRINT on the results. You'll need to basically create the table on your own and then display it.

PROC TABULATE is not suitable for this application, as the summary stats you want will not be calculated appropriately. The percents calculated are for all the data used but since you want a subset the values will not be correct.

Occasional Contributor
Posts: 13

## Re: Top 5 Values of Categorical Variable

The problem is I want univNums to go down the left side, and the top 5 languages across the top like this:

 Top 5 majors for students in program Students in program Major #1 Major #2 Major #3 Major #4 Major #5 N N % N % N % N % N % Overall School 1 School 2 School 3 School 4 School 5 School 6 . . .
Posts: 4,736

## Re: Top 5 Values of Categorical Variable

Here something which gets close to your layout using Proc Tabulate.

Data used in the following code as generated by the code @Reeza posted.

``````
proc sql noprint;
select distinct cats("'",major,"'") into :major_list separated by ','
from top5
;
quit;

options missing=' ';
proc tabulate
data=have(where=(major in (&major_list)))
format=F6.0
;
class  UnivNum Major;
table
all='Overall' UnivNum=' ',
all='Students in program'*n*f=f16.0  Major=" "*(n*f=F6.0 pctn<Major>='%'*f=F6.2)
/rts=15 box='Top 5 majors for students in program'
;
quit;``````
Super User
Posts: 23,724

## Re: Top 5 Values of Categorical Variable

Proc report + a dataset should work. You may even be able to get it from proc report, but I'm not very good with those procs.

Here's a paper that details this type of reporting.

http://www2.sas.com/proceedings/forum2008/173-2008.pdf

Note the code is available for this paper see the last page for the link.

Discussion stats
• 13 replies
• 312 views
• 0 likes
• 5 in conversation