BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
AVUH777
Obsidian | Level 7

Hello,

 

I'm sorry if the question is phrased weird but I am trying to create a new dataset using line data by grouping by two variables. 

 

An example of my current dataset:

 

data WORK.STUDENTS;
infile datalines delimiter=',';
input studentid schoolname $ grade gender $ poverty
;
datalines;
1,WMS,6,F,Y
2,WMS,5,F,N
3,WMS,5,M,Y
4,EH,11,M,N
5,EH,12,F,N
6,EH,12,M,N
7,EH,12,F,N
8,SE,3,M,N
9,SE,3,F,N
10,SE,3,M,Y
;
run;

 Instead of line counts, I would like the final dataset to include school headcounts for each grade, broken down by gender and poverty status. How can I do this?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Can students be duplicated in the data set? 

 

If no duplicates

proc freq data=have;
table schoolname*grade*gender*poverty / list;
run;

 

If duplicates, SQL is easier IMO.

 

proc sql;
create table want as
select schoolname, grade, gender, poverty, count(distinct studentID) as num_student
from have
group by 1, 2, 3, 4;
/*1-4 is faster than typing out the first four column names*/
quit; 

@AVUH777 wrote:

Hello,

 

I'm sorry if the question is phrased weird but I am trying to create a new dataset using line data by grouping by two variables. 

 

An example of my current dataset:

 

data WORK.STUDENTS;
infile datalines delimiter=',';
input studentid schoolname $ grade gender $ poverty
;
datalines;
1,WMS,6,F,Y
2,WMS,5,F,N
3,WMS,5,M,Y
4,EH,11,M,N
5,EH,12,F,N
6,EH,12,M,N
7,EH,12,F,N
8,SE,3,M,N
9,SE,3,F,N
10,SE,3,M,Y
;
run;

 Instead of line counts, I would like the final dataset to include school headcounts for each grade, broken down by gender and poverty status. How can I do this?


 

View solution in original post

6 REPLIES 6
Quentin
Super User

Please show the output data you want from this sample data, and also show the code you have tried.

PaigeMiller
Diamond | Level 26

For most counting problems, use PROC FREQ.

 

proc freq data=students;
    tables schoolname*grade*gender*poverty/list;
run;

 

 

By the way, your data step code doesn't work right, you really ought to make sure it works properly before you present it to us. Do yourself a favor and do us a favor by testing your code. Testing your code is a GOOD thing!

--
Paige Miller
AVUH777
Obsidian | Level 7
Thanks a bunch! I will make sure to do that next time.
Reeza
Super User

Can students be duplicated in the data set? 

 

If no duplicates

proc freq data=have;
table schoolname*grade*gender*poverty / list;
run;

 

If duplicates, SQL is easier IMO.

 

proc sql;
create table want as
select schoolname, grade, gender, poverty, count(distinct studentID) as num_student
from have
group by 1, 2, 3, 4;
/*1-4 is faster than typing out the first four column names*/
quit; 

@AVUH777 wrote:

Hello,

 

I'm sorry if the question is phrased weird but I am trying to create a new dataset using line data by grouping by two variables. 

 

An example of my current dataset:

 

data WORK.STUDENTS;
infile datalines delimiter=',';
input studentid schoolname $ grade gender $ poverty
;
datalines;
1,WMS,6,F,Y
2,WMS,5,F,N
3,WMS,5,M,Y
4,EH,11,M,N
5,EH,12,F,N
6,EH,12,M,N
7,EH,12,F,N
8,SE,3,M,N
9,SE,3,F,N
10,SE,3,M,Y
;
run;

 Instead of line counts, I would like the final dataset to include school headcounts for each grade, broken down by gender and poverty status. How can I do this?


 

AVUH777
Obsidian | Level 7
This was SUPER helpful! Thanks a ton!
ballardw
Super User

This might be overkill but something worth learning:

 

proc summary data=work.students;
  class schoolname grade gender poverty;
  output out=want ;
run;

The output data set will have a variable named _type_ that you can select for other purposes with _freq_ indicating the count for that row in the resulting dataset.

If you look you will see that _type_ indicates combinations of variables, 0 is the whole set overall so gets a count of 10,

1 is count of each poverty level only, 2 is each gender level only, 3 is combinations of gender and poverty, 4 is count of grade only, 5 is combinations of grade and poverty, 6 is combinations of grade and gender, 7 is combinations of grade gender and poverty,  ...

All the combinations possible of all 4 of your variables plus the "all".

 

Options can even get 0 _freq_ for combinations that are not present but have levels in the data.

 

This result data set is larger than the input because 4 variables leads to 16 levels of _typ_, 0 to 15. Astute readers may recognize 16 as 2 to the 4th power...

 

So to print just certain combinations you could select the _type_ such as:

proc print data=want noobs label;
   where _type_ in (8 9 10 11 12 13 14 15);
   var schoolname grade gender poverty _type_;
   label _freq_ = "# students";
run;

FWIW I had one project for many years reporting on school data that involved not just school name grade but school district, county, school admin region, school type (private or public) and immunization status for seven different antigens.

Reporting involved such things as statewide, statewide by school type, statewide by grade, statewide by school type and grade, then school admin region for all of the above, then county and all of the above, then school district by ..., and school.

 

This type of summary was critical as I could generate one data set that had all of the reporting information and select as need for the different documents/ pages of reports. The total project generated about 3,400 pages of reports.

 

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2979 views
  • 0 likes
  • 5 in conversation