I have a dataset containing ID numbers, university number, and major for students in a certain program. I need to find the top 5 majors over all the schools and then the corresponding number and % for each of the top 5 majors (overall) by school number.
For example:
Top 5 majors for students in program | |||||||||||
Students in program | Major #1 | Major #2 | Major #3 | Major #4 | Major #5 | ||||||
N | N | % | N | % | N | % | N | % | N | % | |
Overall | |||||||||||
School 1 | |||||||||||
School 2 | |||||||||||
School 3 | |||||||||||
School 4 | |||||||||||
School 5 | |||||||||||
School 6 | |||||||||||
. | |||||||||||
. | |||||||||||
. |
I am having a hard time due to major being a categorical variable.
I have this code, but it doesn't give me what I am looking for:
proc freq data = have order = freq;
table univNum * Major / noprint out = majors;
run;
proc sort data = majors;
by univNum descending COUNT;
run;
proc means data = majors;
class univNum;
var COUNT;
output out = top5list sum =
idgroup(max(COUNT) out[5] (COUNT)=);
run;
Any advice would be much appreciated.
Isn't your logic missing a step?
You would first need to find the top 5 overall and then filter the percent at each school. It may not be the same...
1. Proc freq for overall - by subject only.
2. Proc freq by subject and school - for correct N and %
3. Filter table 2 by top 5 from step 1
To find the overall top5 you will also have to consider how to deal with ties.
Proc Rank gives you options for this so may be use this procedure to determine the overall top5 majors.
And to throw a potential complication is the Percent you are going to report the percent of all majors or the percent of the major among the top 5? I'm not sure if your "overall" means the denominator is total enrollment in the school or not. Careful definition of numerator and denominator in percents really helps.
And would that school enrollment include masters and above enrolled students? Since your top 5 are very likely to be bachelor's degree programs should the masters / doctoral levels be included?
See this blog post. The main trick is to use PROC FREQ with the ORDER=FREQ option. Then just take the first 5 observations for each school.
Here's an example to get you started. HTH.
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables origin*type / out=FreqOut;
run;
proc sort data=FreqOut;
by Origin descending Count;
run;
data Top3;
set FreqOut;
by origin;
if first.origin then
cnt = 1;
else cnt + 1;
if cnt <= 3; /* top 3 */
run;
I follow your logic which is very helpful, but then my proc tabulate does not give me the proper table I mocked up in my original question. If anyone has any advice I would appreciate it.
In response to other questions, I want to find the top 5 majors over all universities, then the count of those top 5 majors for each university (undergrad only).
proc freq data = work.All ORDER = FREQ noprint; tables Major / out = FreqOut (rename = (count = studentsinmaj)); run; proc sort data = FreqOut; by descending studentsinmaj; run; *top 5 overall; data top5; set FreqOut; by descending studentsinmaj; univNum = 1; *want this for overall summary of all universities; if first.Major then Cnt = 1; else Cnt + 1; if Cnt <= 5; run; proc sort data = top5; by univNum descending studentsinmaj; run; *filter school numbers by the top 5 overall languages; proc freq data = work.All ORDER = FREQ noprint; tables univNum * Major / out = test (rename = (count = studentsinmaj)); where Major in ('Top major #1 by count on top5', 'Top major #2 by count on top5', 'Top major #3 by count on top5', 'Top major #4 by count on top5'); run; proc sort data = test; by univNum descending studentsinmaj; run; data combined; set top5 test; by univNum descending studentsinmaj; run; ODS HTML BODY = 'C:\SAS\primmaj.xls' style = STATDOC; proc tabulate data = combined format = 6.1; class univnum major / missing order = formatted; var studentsinmaj; tables (univnum = ''), N ((major = '' * studentsinmaj = '') * (N RowPctN = '%')) / Printmiss; run; ODS HTML CLOSE;
If you post some sample data perhaps we can help with the code.
StudentNum | UnivNum | Major |
765473521 | 15 | Art |
534651321 | 15 | Biology |
984651316 | 15 | Biology |
534545645 | 15 | Computer Science |
145645012 | 15 | Computer Science |
645124645 | 15 | Computer Science |
878782245 | 15 | Computer Science |
456465678 | 15 | Computer Science |
347568651 | 15 | Computer Science |
789213425 | 15 | Computer Science |
243527654 | 15 | Computer Science |
156456321 | 15 | History |
645541651 | 15 | History |
215465654 | 15 | History |
126545465 | 15 | History |
899875654 | 15 | History |
554545111 | 15 | History |
545645655 | 15 | Math |
651321654 | 15 | Math |
985432135 | 15 | Mechanical Engineering |
654321354 | 15 | Mechanical Engineering |
687613521 | 15 | Mechanical Engineering |
765473521 | 90 | Biology |
534651321 | 90 | Biology |
984651316 | 90 | Biology |
534545645 | 90 | Computer Science |
145645012 | 90 | Computer Science |
645124645 | 90 | Computer Science |
878782245 | 90 | Computer Science |
456465678 | 90 | Computer Science |
347568651 | 90 | Computer Science |
789213425 | 90 | History |
243527654 | 90 | History |
156456321 | 90 | History |
645541651 | 90 | History |
215465654 | 90 | History |
126545465 | 90 | History |
899875654 | 90 | History |
554545111 | 90 | History |
545645655 | 90 | Math |
651321654 | 90 | Math |
985432135 | 90 | Math |
654321354 | 90 | Sociology |
687613521 | 90 | Sociology |
765473521 | 645 | Chemistry |
534651321 | 645 | Chemistry |
984651316 | 645 | Chemistry |
534545645 | 645 | Computer Science |
145645012 | 645 | Computer Science |
645124645 | 645 | Computer Science |
878782245 | 645 | Computer Science |
456465678 | 645 | Electrical Engineering |
347568651 | 645 | Electrical Engineering |
789213425 | 645 | Electrical Engineering |
243527654 | 645 | Electrical Engineering |
156456321 | 645 | Electrical Engineering |
645541651 | 645 | Electrical Engineering |
215465654 | 645 | History |
126545465 | 645 | History |
899875654 | 645 | History |
554545111 | 645 | History |
545645655 | 645 | Math |
651321654 | 645 | Math |
985432135 | 645 | Mechanical Engineering |
654321354 | 645 | Mechanical Engineering |
687613521 | 645 | Mechanical Engineering |
If I understand your problem correctly, I don't think you can use PROC TABULATE since you want to filter the results but still have the percentages reflect the percentages from the University. I think this calculates your data as desired, but doesn't prepare the final table. Hope it helps.
data have;
input StudentNum UnivNum Major $50. ;
cards;
765473521 15 Art
534651321 15 Biology
984651316 15 Biology
534545645 15 Computer Science
145645012 15 Computer Science
645124645 15 Computer Science
878782245 15 Computer Science
456465678 15 Computer Science
347568651 15 Computer Science
789213425 15 Computer Science
243527654 15 Computer Science
156456321 15 History
645541651 15 History
215465654 15 History
126545465 15 History
899875654 15 History
554545111 15 History
545645655 15 Math
651321654 15 Math
985432135 15 Mechanical Engineering
654321354 15 Mechanical Engineering
687613521 15 Mechanical Engineering
765473521 90 Biology
534651321 90 Biology
984651316 90 Biology
534545645 90 Computer Science
145645012 90 Computer Science
645124645 90 Computer Science
878782245 90 Computer Science
456465678 90 Computer Science
347568651 90 Computer Science
789213425 90 History
243527654 90 History
156456321 90 History
645541651 90 History
215465654 90 History
126545465 90 History
899875654 90 History
554545111 90 History
545645655 90 Math
651321654 90 Math
985432135 90 Math
654321354 90 Sociology
687613521 90 Sociology
765473521 645 Chemistry
534651321 645 Chemistry
984651316 645 Chemistry
534545645 645 Computer Science
145645012 645 Computer Science
645124645 645 Computer Science
878782245 645 Computer Science
456465678 645 Electrical Engineering
347568651 645 Electrical Engineering
789213425 645 Electrical Engineering
243527654 645 Electrical Engineering
156456321 645 Electrical Engineering
645541651 645 Electrical Engineering
215465654 645 History
126545465 645 History
899875654 645 History
554545111 645 History
545645655 645 Math
651321654 645 Math
985432135 645 Mechanical Engineering
654321354 645 Mechanical Engineering
687613521 645 Mechanical Engineering
;
run;
proc freq data=have order=freq noprint;
table major/out=summary_overall;
run;
data top5;
set summary_overall(obs=5);
run;
proc freq data=have noprint ;
table univnum*major/out=summary_uni (drop=pct_col) outpct ;
run;
proc sql;
create table want as
select *
from summary_uni
where major in (select major from top5);
quit;
Thank you for this helpful code. Any suggestions for getting the table I want, with the top 5 majors overall and then the counts and percents of those 5 for each school? I am new to creating Excel tables in SAS. I have been trying to learn Proc Tabulate.
The WANT table has this information, I'm assuming that all Majors are offered at all Universities and I'm not sure this is true. You may need to account for 0.
To get the table you want, tranpsose the WANT table and use PROC PRINT on the results. You'll need to basically create the table on your own and then display it.
PROC TABULATE is not suitable for this application, as the summary stats you want will not be calculated appropriately. The percents calculated are for all the data used but since you want a subset the values will not be correct.
The problem is I want univNums to go down the left side, and the top 5 languages across the top like this:
Top 5 majors for students in program | |||||||||||
Students in program | Major #1 | Major #2 | Major #3 | Major #4 | Major #5 | ||||||
N | N | % | N | % | N | % | N | % | N | % | |
Overall | |||||||||||
School 1 | |||||||||||
School 2 | |||||||||||
School 3 | |||||||||||
School 4 | |||||||||||
School 5 | |||||||||||
School 6 | |||||||||||
. | |||||||||||
. | |||||||||||
. |
Here something which gets close to your layout using Proc Tabulate.
Data used in the following code as generated by the code @Reeza posted.
proc sql noprint;
select distinct cats("'",major,"'") into :major_list separated by ','
from top5
;
quit;
options missing=' ';
proc tabulate
data=have(where=(major in (&major_list)))
format=F6.0
;
class UnivNum Major;
table
all='Overall' UnivNum=' ',
all='Students in program'*n*f=f16.0 Major=" "*(n*f=F6.0 pctn<Major>='%'*f=F6.2)
/rts=15 box='Top 5 majors for students in program'
;
quit;
Proc report + a dataset should work. You may even be able to get it from proc report, but I'm not very good with those procs.
Here's a paper that details this type of reporting.
http://www2.sas.com/proceedings/forum2008/173-2008.pdf
Note the code is available for this paper see the last page for the link.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.