Hi,
I have a large data set of 17,033 subjects originating of a survey of schools. The data shown below is part of the data set, and it includes the subject id (ID), province code (PROVINCE), school code (SCHOOL), the gender of the student (GENDER), and the total number of schools in each province (TOT_SCHOOL). In order to calculate the sampling probabilities I need to add to this data set 2 new variables: number of schools sampled in each province (N_SCHOOL), and proportion of subjects of each gender within each province (PROP_GENDER). The number of schools sampled in each province varies. Below I include the HAVE data set, and the WANT data set. The values of N_SCHOOL and PROP_GENDER are from Proc Freq (tables shown below) and entered manually.
I greatly appreciate any programming help (proc sql or otherwise) so that I can apply the solution to this data and future surveys.
Data HAVE
Obs ID province School Gender tot_school
1 11189 Aga ElHli 2 53
2 11190 Aga ElHli 2 53
3 11191 Aga ElHli 2 53
4 11192 Aga ElHli 2 53
5 11193 Aga ElHli 2 53
6 11194 Aga ElHli 2 53
7 11195 Aga ElHli 2 53
8 11196 Aga ElHli 2 53
9 11197 Aga ElHli 2 53
10 11198 Aga ElHli 2 53
.
.
17031 3844 kal Tast 1 46
17032 3845 kal Tast 2 46
17033 3846 kal Tast 1 46
17034 3847 kal Tast 2 46
Data WANT
Obs ID province School Gender tot_school n_school prop_gender
1 11189 Aga ElHli 2 53 3 41.91
2 11190 Aga ElHli 2 53 3 41.91
3 11191 Aga ElHli 2 53 3 41.91
4 11192 Aga ElHli 2 53 3 41.91
5 11193 Aga ElHli 2 53 3 41.91
6 11194 Aga ElHli 2 53 3 41.91
7 11195 Aga ElHli 2 53 3 41.91
8 11196 Aga ElHli 2 53 3 41.91
9 11197 Aga ElHli 2 53 3 41.91
10 11198 Aga ElHli 2 53 3 41.91
.
.
17031 3844 kal Tast 1 46 3 49.89
17032 3845 kal Tast 2 46 3 50.11
17033 3846 kal Tast 1 46 3 50.11
17034 3847 kal Tast 2 46 3 49.89
proc freq;
tables province * gender/nofreq nopercent nocol;
run;
Table of province by Gender
Province Gender
Row Pct ‚ 1‚ 2‚ Total
---------------------------------------------------------
Aga ‚ 58.09 ‚ 41.91 ‚
Ben ‚ 50.50 ‚ 49.50 ‚
Ber ‚ 48.47 ‚ 51.53 ‚
.
.
kal ‚ 49.89 ‚ 50.11 ‚
-------------------------------------------------------
Total 8130 8903 17033
proc freq;
tables province * school /list nofreq nocol;
run;
The FREQ Procedure
Cumulative Cumulative
province School Frequency Percent Frequency Percent
------------------------------------------------------------------------------------------
Aga ElHli 200 1.17 200 1.17
Aga Lycla 498 2.92 698 4.10
Aga Nab 242 1.42 940 5.52
Ben Ibndoun 374 2.20 1314 7.71
Ben Siddim 89 0.52 1403 8.24
Ben ZIA 439 2.58 1842 10.81
Ber Elfa 229 1.34 2071 12.16
Ber Elq 238 1.40 2309 13.56
Ber Ibn 220 1.29 2529 14.85
.
.
kal Abo 119 0.70 16710 98.10
kal Kha 224 1.32 16934 99.41
kal Tast 100 0.59 17034 100.00
Ill give an example of your latter statistic, prop_gender. Since you are familiar with PROG FREQ, if you want your numbers that the procedure outputs, you may use these options:
proc freq data=SASHELP.CLASS;
tables age * sex/ noprint OUTPCT
out=result(where=(sex="M")
drop=count PERCENT PCT_COL
) ;
run;
And this data set you can merge in with your original data.
Age | Sex | PCT_ROW |
11 | M | 50 |
12 | M | 60 |
13 | M | 33.33333 |
14 | M | 50 |
15 | M | 50 |
16 | M | 100 |
Ill give an example of your latter statistic, prop_gender. Since you are familiar with PROG FREQ, if you want your numbers that the procedure outputs, you may use these options:
proc freq data=SASHELP.CLASS;
tables age * sex/ noprint OUTPCT
out=result(where=(sex="M")
drop=count PERCENT PCT_COL
) ;
run;
And this data set you can merge in with your original data.
Age | Sex | PCT_ROW |
11 | M | 50 |
12 | M | 60 |
13 | M | 33.33333 |
14 | M | 50 |
15 | M | 50 |
16 | M | 100 |
Thank you very much.
What is the equivalent of this in Proc SQL?
For you N_School stat, which is the cardinality of School within each Province, I'll use SASHELP.SHOES :
PROC SQL;
SELECT Region,
(COUNT(DISTINCT(Subsidiary))) AS COUNT_DISTINCT_of_Subsidiary
FROM SASHELP.SHOES
GROUP BY Region;
QUIT;
And you can check the dataset to verify these numbers are true counts:
Region | COUNT_DISTINCT_of_Subsidiary |
Africa | 8 |
Asia | 3 |
Canada | 5 |
Central America/Caribbean | 4 |
Eastern Europe | 4 |
Middle East | 3 |
Pacific | 6 |
South America | 7 |
United States | 5 |
Western Europe | 8 |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.