turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Weighted Descriptive Statistics

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-09-2018 02:55 PM

In working on weighted descriptive statistics, I started with PROC SURVEYFREQ as I am interested in analyzing one variable and there upon a subsample of the population within that variable: more specifically, among those who are physically active, how many are White, Black, Hispanic, and Asian. Putting in a 'where' statement gives a note stating: The input data set is subset by WHERE, OBS, or FIRSTOBS. This provides a completely separate analysis of the subset. It does not provide a statistically valid subpopulation or omain analysis, where the total number of units in the subpopulation is not known with certainty. If you want a domain analysis, you should include the domain variables in the TABLES request.

I changed the code to a PROC SURVEYMEANS and used the "domain statement" instead of the "where" statement. The variables that I looked into generated the means were "active_cat, insufficient_cat, inactive_cat" which were 1 if the main outcome variable and 0 if not.

For example among those who are White non-Hispanic code is provided below:

PROC SURVEYMEANS DATA = FINAL;

DOMAIN WHITE_MEPS;

VAR ACTIVE_CAT INSUFFICIENT_CAT INACTIVE_CAT;

WEIGHT "WEIGHT";

RUN;

ACTIVE_CAT: 1 is active; 0 is either inactive or insufficient

INSUFFICIENT: 1 is insufficient; 0 is either active or inactive

INACT: 1 is inactive; 0 is either active or insufficient

I ran the same code for DOMAIN Black non-Hispanic, Asian non-Hispanic, and Hispanic.

Output is below:

WHITE_NH Std Error

N Mean of Mean 95% CL for Mean

0 ACTIVE_CAT 31055 0.426972 0.003908 0.41931346 0.43463105

INSUFFICIENT_CAT 31055 0.191087 0.003140 0.18493214 0.19724246

INACTIVE_CAT 31055 0.381940 0.003847 0.37440091 0.38947999

** 1 ACTIVE_CAT 38705 0.435237 0.003391 0.42859149 0.44188345**

** INSUFFICIENT_CAT 38705 0.1923**

** INACTIVE_CAT 38705 0.3724**

Is it correct that the bolded text (White_nH = 1) would provide the proper distribution among the subsample?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to buder

05-10-2018 11:16 AM - edited 05-10-2018 11:47 AM

@buder wrote:

In working on weighted descriptive statistics, I started with PROC SURVEYFREQ as I am interested in analyzing one variable and there upon a subsample of the population within that variable: more specifically, among those who are physically active, how many are White, Black, Hispanic, and Asian. Putting in a 'where' statement gives a note stating: The input data set is subset by WHERE, OBS, or FIRSTOBS. This provides a completely separate analysis of the subset. It does not provide a statistically valid subpopulation or omain analysis, where the total number of units in the subpopulation is not known with certainty. If you want a domain analysis, you should include the domain variables in the TABLES request.

I changed the code to a PROC SURVEYMEANS and used the "domain statement" instead of the "where" statement. The variables that I looked into generated the means were "active_cat, insufficient_cat, inactive_cat" which were 1 if the main outcome variable and 0 if not.

For example among those who are White non-Hispanic code is provided below:

PROC SURVEYMEANS DATA = FINAL;

DOMAIN WHITE_MEPS;

VAR ACTIVE_CAT INSUFFICIENT_CAT INACTIVE_CAT;

WEIGHT "WEIGHT";

RUN;

ACTIVE_CAT: 1 is active; 0 is either inactive or insufficient

INSUFFICIENT: 1 is insufficient; 0 is either active or inactive

INACT: 1 is inactive; 0 is either active or insufficient

I ran the same code for DOMAIN Black non-Hispanic, Asian non-Hispanic, and Hispanic.

Output is below:

WHITE_NH Std Error

N Mean of Mean 95% CL for Mean

0 ACTIVE_CAT 31055 0.426972 0.003908 0.41931346 0.43463105

INSUFFICIENT_CAT 31055 0.191087 0.003140 0.18493214 0.19724246

INACTIVE_CAT 31055 0.381940 0.003847 0.37440091 0.38947999

1 ACTIVE_CAT 38705 0.435237 0.003391 0.42859149 0.44188345

INSUFFICIENT_CAT 38705 0.1923

INACTIVE_CAT 38705 0.3724

Is it correct that the bolded text (White_nH = 1) would provide the proper distribution among the subsample?

If your domain variable meaning of 1 indicates membership in a category then the 0.435237 should indicate that 43.52 percent of the domain are "active_cat".

You may need to consider whether you are providing all of the appropriate sample information to procedure though. Is your sample stratified, possibly by geographic region? Then you should have a strata statement.

If this data comes from the BRFSS, as seems possible from the category descriptions, you likely need a Cluster statement as the household is the primary sampling unit and is a cluster (selected from adults in the household sound familiar). IF the data is BRFSS you may have a variable _psu for that purpose.

And for data points that may be missing due to skip patterns in the survey you may want the option NOMCAR on the proc statement (not missing completely at random)