Top 5 Values of Categorical Variable

Reply
Occasional Contributor
Posts: 10

Top 5 Values of Categorical Variable

[ Edited ]

I have a dataset containing ID numbers, university number, and major for students in a certain program.  I need to find the top 5 majors over all the schools and then the corresponding number and % for each of the top 5 majors (overall) by school number.

 

For example:

Top 5 majors for students in program                      
  Students in program Major #1 Major #2 Major #3 Major #4 Major #5
  N N % N % N % N % N %
Overall                      
School 1                      
School 2                      
School 3                      
School 4                      
School 5                      
School 6                      
.                      
.                      
.                      

 

I am having a hard time due to major being a categorical variable.

 

I have this code, but it doesn't give me what I am looking for:

 

proc freq data = have order = freq;

table univNum * Major / noprint out = majors;

run;

 

proc sort data = majors;

by univNum descending COUNT;

run;

 

proc means data = majors;

class univNum;

var COUNT;

output out = top5list sum = 

idgroup(max(COUNT) out[5] (COUNT)=);

run;

 

Any advice would be much appreciated.

Super User
Posts: 19,772

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

Isn't your logic missing a step? 

You would first need to find the top 5 overall and then filter the percent at each school. It may not be the same...

 

1. Proc freq for overall - by subject only. 

2. Proc freq by subject and school - for correct N and %

3. Filter table 2 by top 5 from step 1

 

 

Respected Advisor
Posts: 4,173

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

To find the overall top5 you will also have to consider how to deal with ties.

Proc Rank gives you options for this so may be use this procedure to determine the overall top5 majors.

Super User
Posts: 11,343

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

And to throw a potential complication is the Percent you are going to report the percent of all majors or the percent of the major among the top 5? I'm not sure if your "overall" means the denominator is total enrollment in the school or not. Careful definition of numerator and denominator in percents really helps.

And would that school enrollment include masters and above enrolled students? Since your top 5 are very likely to be bachelor's degree programs should the masters / doctoral levels be included?

SAS Super FREQ
Posts: 3,752

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

See this blog post. The main trick is to use PROC FREQ with the ORDER=FREQ option. Then just take the first 5 observations  for  each school.

 

Here's an example to get  you started. HTH.

proc freq data=sashelp.cars ORDER=FREQ noprint;
tables origin*type / out=FreqOut;
run;
proc sort data=FreqOut; 
by Origin  descending Count; 
run;

data Top3;
set FreqOut;
by origin;
if first.origin then 
   cnt = 1;
else cnt + 1;
if cnt <= 3; /* top 3 */
run;

 

Occasional Contributor
Posts: 10

Re: Top 5 Values of Categorical Variable

[ Edited ]

I follow your logic which is very helpful, but then my proc tabulate does not give me the proper table I mocked up in my original question.  If anyone has any advice I would appreciate it.

 

In response to other questions, I want to find the top 5 majors over all universities, then the count of those top 5 majors for each university (undergrad only).

 

 

proc freq data = work.All ORDER = FREQ noprint;
tables Major / out = FreqOut (rename = (count = studentsinmaj));
run;
 
proc sort data = FreqOut;
by descending studentsinmaj;
run;
 
*top 5 overall;
data top5;
set FreqOut;
by descending studentsinmaj;
univNum = 1; *want this for overall summary of all universities;
if first.Major then
      Cnt = 1;
else Cnt + 1;
if Cnt <= 5;
run;
 
proc sort data = top5;
by univNum descending studentsinmaj;
run;
 
*filter school numbers by the top 5 overall languages;
proc freq data = work.All ORDER = FREQ noprint;
tables univNum * Major / out = test (rename = (count = studentsinmaj));
where Major in ('Top major #1 by count on top5', 'Top major #2 by count on top5', 'Top major #3 by count on top5', 'Top major #4 by count on top5');
run;
 
proc sort data = test;
by univNum descending studentsinmaj;
run;
 
 
data combined;
set top5 test;
by univNum descending studentsinmaj;
run;
 
ODS HTML BODY = 'C:\SAS\primmaj.xls'
style = STATDOC;
proc tabulate data = combined format = 6.1;
class univnum major / missing order = formatted;
var studentsinmaj;
tables (univnum = ''), N ((major = '' * studentsinmaj = '') * (N RowPctN = '%')) / Printmiss;
run;
ODS HTML CLOSE;

 

Super User
Posts: 19,772

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

If you post some sample data perhaps we can help with the code. 

Occasional Contributor
Posts: 10

Re: Top 5 Values of Categorical Variable

StudentNum UnivNum Major
765473521 15 Art
534651321 15 Biology
984651316 15 Biology
534545645 15 Computer Science
145645012 15 Computer Science
645124645 15 Computer Science
878782245 15 Computer Science
456465678 15 Computer Science
347568651 15 Computer Science
789213425 15 Computer Science
243527654 15 Computer Science
156456321 15 History
645541651 15 History
215465654 15 History
126545465 15 History
899875654 15 History
554545111 15 History
545645655 15 Math
651321654 15 Math
985432135 15 Mechanical Engineering
654321354 15 Mechanical Engineering
687613521 15 Mechanical Engineering
765473521 90 Biology
534651321 90 Biology
984651316 90 Biology
534545645 90 Computer Science
145645012 90 Computer Science
645124645 90 Computer Science
878782245 90 Computer Science
456465678 90 Computer Science
347568651 90 Computer Science
789213425 90 History
243527654 90 History
156456321 90 History
645541651 90 History
215465654 90 History
126545465 90 History
899875654 90 History
554545111 90 History
545645655 90 Math
651321654 90 Math
985432135 90 Math
654321354 90 Sociology
687613521 90 Sociology
765473521 645 Chemistry
534651321 645 Chemistry
984651316 645 Chemistry
534545645 645 Computer Science
145645012 645 Computer Science
645124645 645 Computer Science
878782245 645 Computer Science
456465678 645 Electrical Engineering
347568651 645 Electrical Engineering
789213425 645 Electrical Engineering
243527654 645 Electrical Engineering
156456321 645 Electrical Engineering
645541651 645 Electrical Engineering
215465654 645 History
126545465 645 History
899875654 645 History
554545111 645 History
545645655 645 Math
651321654 645 Math
985432135 645 Mechanical Engineering
654321354 645 Mechanical Engineering
687613521 645 Mechanical Engineering
Super User
Posts: 19,772

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

If I understand your problem correctly, I don't think you can use PROC TABULATE since you want to filter the results but still have the percentages reflect the percentages from the University. I think this calculates your data as desired, but doesn't prepare the final table. Hope it helps.

 

data have;
input StudentNum	UnivNum	Major $50. ;
cards;
765473521	15	Art
534651321	15	Biology
984651316	15	Biology
534545645	15	Computer Science
145645012	15	Computer Science
645124645	15	Computer Science
878782245	15	Computer Science
456465678	15	Computer Science
347568651	15	Computer Science
789213425	15	Computer Science
243527654	15	Computer Science
156456321	15	History
645541651	15	History
215465654	15	History
126545465	15	History
899875654	15	History
554545111	15	History
545645655	15	Math
651321654	15	Math
985432135	15	Mechanical Engineering
654321354	15	Mechanical Engineering
687613521	15	Mechanical Engineering
765473521	90	Biology
534651321	90	Biology
984651316	90	Biology
534545645	90	Computer Science
145645012	90	Computer Science
645124645	90	Computer Science
878782245	90	Computer Science
456465678	90	Computer Science
347568651	90	Computer Science
789213425	90	History
243527654	90	History
156456321	90	History
645541651	90	History
215465654	90	History
126545465	90	History
899875654	90	History
554545111	90	History
545645655	90	Math
651321654	90	Math
985432135	90	Math
654321354	90	Sociology
687613521	90	Sociology
765473521	645	Chemistry
534651321	645	Chemistry
984651316	645	Chemistry
534545645	645	Computer Science
145645012	645	Computer Science
645124645	645	Computer Science
878782245	645	Computer Science
456465678	645	Electrical Engineering
347568651	645	Electrical Engineering
789213425	645	Electrical Engineering
243527654	645	Electrical Engineering
156456321	645	Electrical Engineering
645541651	645	Electrical Engineering
215465654	645	History
126545465	645	History
899875654	645	History
554545111	645	History
545645655	645	Math
651321654	645	Math
985432135	645	Mechanical Engineering
654321354	645	Mechanical Engineering
687613521	645	Mechanical Engineering
;
run;

proc freq data=have order=freq noprint;
table major/out=summary_overall;
run;

data top5;
set summary_overall(obs=5);
run;

proc freq data=have noprint ;
table univnum*major/out=summary_uni (drop=pct_col) outpct ;
run;

proc sql;
create table want as
select *
from summary_uni
where major in (select major from top5);
quit;



Occasional Contributor
Posts: 10

Re: Top 5 Values of Categorical Variable

Thank you for this helpful code.  Any suggestions for getting the table I want, with the top 5 majors overall and then the counts and percents of those 5 for each school?  I am new to creating Excel tables in SAS.  I have been trying to learn Proc Tabulate.

Super User
Posts: 19,772

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

The WANT table has this information, I'm assuming that all Majors are offered at all Universities and I'm not sure this is true. You may need to account for 0. 

 

To get the table you want, tranpsose the WANT table and use PROC PRINT on the results. You'll need to basically create the table on your own and then display it. 

 

PROC TABULATE is not suitable for this application, as the summary stats you want will not be calculated appropriately. The percents calculated are for all the data used but since you want a subset the values will not be correct. 

 

 

Occasional Contributor
Posts: 10

Re: Top 5 Values of Categorical Variable

The problem is I want univNums to go down the left side, and the top 5 languages across the top like this:

 

Top 5 majors for students in program                      
  Students in program Major #1 Major #2 Major #3 Major #4 Major #5
  N N % N % N % N % N %
Overall                      
School 1                      
School 2                      
School 3                      
School 4                      
School 5                      
School 6                      
.                      
.                      
.                      
Respected Advisor
Posts: 4,173

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

Here something which gets close to your layout using Proc Tabulate.

Data used in the following code as generated by the code @Reeza posted.


proc sql noprint;
  select distinct cats("'",major,"'") into :major_list separated by ','
  from top5
  ;
quit;
 
options missing=' ';
proc tabulate 
    data=have(where=(major in (&major_list))) 
    format=F6.0 
    ;
  class  UnivNum Major;
  table
    all='Overall' UnivNum=' ',
    all='Students in program'*n*f=f16.0  Major=" "*(n*f=F6.0 pctn<Major>='%'*f=F6.2)
    /rts=15 box='Top 5 majors for students in program'
  ;
quit;
Super User
Posts: 19,772

Re: Top 5 Values of Categorical Variable

Posted in reply to missmeliss22

Proc report + a dataset should work. You may even be able to get it from proc report, but I'm not very good with those procs. 

 

Here's a paper that details this type of reporting. 

http://www2.sas.com/proceedings/forum2008/173-2008.pdf

Note the code is available for this paper see the last page for the link. 

 

 

Ask a Question
Discussion stats
  • 13 replies
  • 206 views
  • 0 likes
  • 5 in conversation