Contributor
Posts: 24

# Weighted Descriptive Statistics

In working on weighted descriptive statistics, I started with PROC SURVEYFREQ as I am interested in analyzing one variable and there upon a subsample of the population within that variable: more specifically, among those who are physically active, how many are White, Black, Hispanic, and Asian. Putting in a 'where' statement gives a note stating: The input data set is subset by WHERE, OBS, or FIRSTOBS. This provides a completely separate analysis of the subset. It does not provide a statistically valid subpopulation or  omain analysis, where the total number of units in the subpopulation is not known with certainty. If you want a domain analysis, you should include the domain variables in the TABLES request.

I changed the code to a PROC SURVEYMEANS and used the "domain statement" instead of the "where" statement. The variables that I looked into generated the means were "active_cat, insufficient_cat, inactive_cat" which were 1 if the main outcome variable and 0 if not.

For example among those who are White non-Hispanic code is provided below:

PROC SURVEYMEANS DATA = FINAL;

DOMAIN WHITE_MEPS;

VAR ACTIVE_CAT INSUFFICIENT_CAT INACTIVE_CAT;

WEIGHT "WEIGHT";

RUN;

ACTIVE_CAT: 1 is active; 0 is either inactive or insufficient

INSUFFICIENT: 1 is insufficient; 0 is either active or inactive

INACT: 1 is inactive; 0 is either active or insufficient

I ran the same code for DOMAIN Black non-Hispanic, Asian non-Hispanic, and Hispanic.

Output is below:

WHITE_NH                                                                        Std Error

N           Mean              of Mean      95% CL for Mean

0   ACTIVE_CAT                31055       0.426972       0.003908   0.41931346 0.43463105

INSUFFICIENT_CAT     31055       0.191087       0.003140   0.18493214 0.19724246

INACTIVE_CAT             31055       0.381940       0.003847   0.37440091 0.38947999

1   ACTIVE_CAT                38705       0.435237       0.003391   0.42859149 0.44188345

INSUFFICIENT_CAT    38705       0.1923

INACTIVE_CAT             38705       0.3724

Is it correct that the bolded text (White_nH = 1) would provide the proper distribution among the subsample?

Super User
Posts: 13,523

## Re: Weighted Descriptive Statistics

[ Edited ]

@buder wrote:

In working on weighted descriptive statistics, I started with PROC SURVEYFREQ as I am interested in analyzing one variable and there upon a subsample of the population within that variable: more specifically, among those who are physically active, how many are White, Black, Hispanic, and Asian. Putting in a 'where' statement gives a note stating: The input data set is subset by WHERE, OBS, or FIRSTOBS. This provides a completely separate analysis of the subset. It does not provide a statistically valid subpopulation or  omain analysis, where the total number of units in the subpopulation is not known with certainty. If you want a domain analysis, you should include the domain variables in the TABLES request.

I changed the code to a PROC SURVEYMEANS and used the "domain statement" instead of the "where" statement. The variables that I looked into generated the means were "active_cat, insufficient_cat, inactive_cat" which were 1 if the main outcome variable and 0 if not.

For example among those who are White non-Hispanic code is provided below:

PROC SURVEYMEANS DATA = FINAL;

DOMAIN WHITE_MEPS;

VAR ACTIVE_CAT INSUFFICIENT_CAT INACTIVE_CAT;

WEIGHT "WEIGHT";

RUN;

ACTIVE_CAT: 1 is active; 0 is either inactive or insufficient

INSUFFICIENT: 1 is insufficient; 0 is either active or inactive

INACT: 1 is inactive; 0 is either active or insufficient

I ran the same code for DOMAIN Black non-Hispanic, Asian non-Hispanic, and Hispanic.

Output is below:

WHITE_NH                                                                        Std Error

N           Mean              of Mean      95% CL for Mean

0   ACTIVE_CAT                31055       0.426972       0.003908   0.41931346 0.43463105

INSUFFICIENT_CAT     31055       0.191087       0.003140   0.18493214 0.19724246

INACTIVE_CAT             31055       0.381940       0.003847   0.37440091 0.38947999

1   ACTIVE_CAT                38705       0.435237       0.003391   0.42859149 0.44188345

INSUFFICIENT_CAT    38705       0.1923

INACTIVE_CAT             38705       0.3724

Is it correct that the bolded text (White_nH = 1) would provide the proper distribution among the subsample?

If your domain variable meaning of 1 indicates membership in a category then the 0.435237 should indicate that 43.52 percent of the domain are "active_cat".

You may need to consider whether you are providing all of the appropriate sample information to procedure though. Is your sample stratified, possibly by geographic region? Then you should have a strata statement.

If this data comes from the BRFSS, as seems possible from the category descriptions, you likely need a Cluster statement as the household is the primary sampling unit and is a cluster (selected from adults in the household sound familiar). IF the data is BRFSS you may have a variable _psu for that purpose.

And for data points that may be missing due to skip patterns in the survey you may want the option NOMCAR on the proc statement (not missing completely at random)

Discussion stats