- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a demographic data set with no missing data that covers a 12 month period. It is built like the below:
data a;
input date :monyy. age sex race count;
format date monyy.;
datalines;
JAN18 24 0 0 1
JAN18 22 1 1 3
FEB18 19 1 2 1
MAR18 20 0 0 4
MAR18 20 1 4 2
;
RUN;
The categorical variables are numerical with values representing different levels (e.g., sex=0 is male, sex=1 is female; race=0 is white, =1 is black, =2 is asian, etc.). Count means how many individuals with that combination of preceding variables are in that month sample. There are no individual identifiers in the sample.
The total number of individuals and distribution of levels of variables changes slightly each month, and I'm wondering how best to find the one month average for each variable and value due to some other monthly analyses.
In other words, I want to average the 12 month period demographics to one month for a "table 1" since I don't have individual information of how they may come and go over the study period (i.e., if there are 12000 people in the entire study, the one-month average n=1000; if there are 9000 white people over the 12 month period then white people would represent 750 of the one month sample, etc.). I would want this to round to the nearest whole person.
Any ideas on how best to code? I appreciate the help.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you just want "sum of subject/number of month" for each level of each variabl
data a; input date :monyy. age sex race count; format date monyy.; datalines; JAN18 24 0 0 1 JAN18 22 1 1 3 FEB18 19 1 2 1 MAR18 20 0 0 4 MAR18 20 1 4 2 ; RUN; data temp; set a; do i=1 to count; output; end; drop i count; run; proc sql; create table want as select 'age' as var,age as levels,int(count(*)/count(distinct date)) as average from temp group by age union all select 'sex' as var,sex,int(count(*)/count(distinct date)) from temp group by sex union all select 'race' as var,race,int(count(*)/count(distinct date)) from temp group by race union all select 'Total' as var,.,int(count(*)/count(distinct date)) from temp ; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Honestly, a larger data set (covering a whole year) even if it is fake data, and showing us the desired output would help us understand the problem better.
It sounds like you want some sort of sum or average or weighted sum or weighted average over the entire year. PROC SUMMARY will do this. It can also easily slice the data into different results for combinations of age sex and race.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you just want "sum of subject/number of month" for each level of each variabl
data a; input date :monyy. age sex race count; format date monyy.; datalines; JAN18 24 0 0 1 JAN18 22 1 1 3 FEB18 19 1 2 1 MAR18 20 0 0 4 MAR18 20 1 4 2 ; RUN; data temp; set a; do i=1 to count; output; end; drop i count; run; proc sql; create table want as select 'age' as var,age as levels,int(count(*)/count(distinct date)) as average from temp group by age union all select 'sex' as var,sex,int(count(*)/count(distinct date)) from temp group by sex union all select 'race' as var,race,int(count(*)/count(distinct date)) from temp group by race union all select 'Total' as var,.,int(count(*)/count(distinct date)) from temp ; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This looks like exactly what I am looking for, thanks! I will add the remaining variables and I believe this will get me everything I need. I appreciate your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't think @Ksharp has provided the best answer. First, PROC SUMMARY will be a lot less typing and probably execute faster. In the case where your real data set has more than the three variables (age sex race), you will have to do a lot more typing in SQL than you would have to do in PROC SUMMARY.
If there is a need to slice the data by two (or more) variables at one time, such as all combinations of age and sex, PROC SUMMARY handles this easily, with little extra coding. The extra coding needed for an SQL solution is not obvious.
Furthermore, if the real data has missing values in variable COUNT, SQL gives the wrong answers, while SUMMARY gives the right answers.
PROC SUMMARY is a much better tool here.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure I understand exactly what you want. I guess you want the average of count for each level of RACE, of AGE, and of SEX. If so, then this would generate a report of those averages:
data a;
input date :monyy. age sex race count;
format date monyy.;
datalines;
JAN18 24 0 0 1
JAN18 22 1 1 3
FEB18 19 1 2 1
MAR18 20 0 0 4
MAR18 20 1 4 2
RUN;
proc means data=a;
class age sex race ;
var count;
types age sex race;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So you are starting with the counts per combination. Basically your starting point is the output from this TABLES statement of PROC FREQ against your individual level data.
tables date*age*sex*race / noprint out=A ;
But what is it that want to get from this? I notice that some of your possible values for RACE etc do not actually appear in the data. To get summaries for those levels you might want to use the completetypes option of PROC MEANS/SUMMARY.
data a;
input date :monyy. age sex race count;
format date monyy7.;
datalines;
JAN18 24 0 0 1
JAN18 22 1 1 3
FEB18 19 1 2 1
MAR18 20 0 0 4
MAR18 20 1 4 2
;
proc format ;
value sex 0='male' 1='female';
value race 0='white' 1='black' 2='asian' 3='other race1' 4='other race2';
value age 19='19' 20='20' 21='21' 22='22' 23='23' 24='24';
value months '01JAN2018'd='2018-01' '01FEB2018'd='2018-02' '01MAR2018'd='2018-03' ;
run;
proc means mean min max completetypes;
class date age sex race / preloadfmt ;
ways 1;
var count;
format date months. age age. sex sex. race race. ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I had a question that is a little different from the topic I asked, so I wanted to message you directly. Is there a way in proc sql to add a column that provides the relative frequency of each "average" by level of the variable? For example, if "sex" had two levels and my "0" average was 4000 and my "1" average was 1000, I would want to see .8 and .2, respectively. Thank you.
data a;
input date :monyy. age sex race count;
format date monyy.;
datalines;
JAN18 24 0 0 1
JAN18 22 1 1 3
FEB18 19 1 2 1
MAR18 20 0 0 4
MAR18 20 1 4 2
;
RUN;
data temp;
set a;
do i=1 to count;
output;
end;
drop i count;
run;
proc sql;
create table want as
select 'age' as var,age as levels,int(count(*)/count(distinct date)) as average,
calculated average/
(select sum(average) from (select int(count(*)/count(distinct date)) as average from temp group by age)) as rate
from temp
group by age
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data a;
input date :monyy. age sex race count;
format date monyy.;
datalines;
JAN18 24 0 0 1
JAN18 22 1 1 3
FEB18 19 1 2 1
MAR18 20 0 0 4
MAR18 20 1 4 2
;
RUN;
data temp;
set a;
do i=1 to count;
output;
end;
drop i count;
run;
proc sql;
create table want as
select 'age' as var,age as levels,int(count(*)/count(distinct date)) as average,
calculated average/
(select sum(average) from (select int(count(*)/count(distinct date)) as average from temp group by age)) as rate
from temp
group by age
union all
select 'sex' as var,sex,int(count(*)/count(distinct date)) as average,
calculated average/
(select sum(average) from (select int(count(*)/count(distinct date)) as average from temp group by sex)) as rate
from temp
group by sex
union all
select 'race' as var,race,int(count(*)/count(distinct date)) as average,
calculated average/
(select sum(average) from (select int(count(*)/count(distinct date)) as average from temp group by race)) as rate
from temp
group by race
;
quit;