BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
issac
Fluorite | Level 6

Hi folks;

Have been doing on a project aims to predict physician workload (in terms of Primary Care Relative Value Unit) based on some patient demographic attributes and, more importantly, illness types. One problem, or I though to be a problem, is that for each row (patient), there is a set of 1s and 0s across disease columns (1= the patient had the illness during the past year and had got visited for it) but there is only one cell for PCRVU accounted for all physician time spent on total visits over the past year. So we cannot say how much of PCRVU is related to, say Diabetes, and how much of it is related to, say Vascular problem, and so on. Now we think the job can be done in two levels, first is how to split PCRVU among the disease types and second how to analyze divided PCRVUs with multivariate GLIMMIX. The reason for selecting GLIMMIX lies in the fact that there is no guarantee that the PCRVU follows approximately normal and also we do have to include such random effects like administrative region  (VISN) for

in order to account for potential heteroscedasticity and possible correlations within clusters of patients. Now I 'm asking whether there is a way to address this problem in SAS 9.3. Any helpful comments would be highly appreciative.


P.S. a sample excel file is attached. ACC001 through ACC030 are the disease types.


Thanks!

Issac

1 ACCEPTED SOLUTION

Accepted Solutions
oloolo
Fluorite | Level 6

It is doable, but you need to transpose your data into multiple rows per original equation, and add ID variable for each group. Each row with same predictors but different response value from originally different response variables. You should also specify a variable indicating corresponding distribution for each response. Using Dist=BYOBS

View solution in original post

5 REPLIES 5
1zmm
Quartz | Level 8

You might first consider some simple ways to assess the different disease types and to reduce their number to something more manageable:

  1. Calculate the number of patients who have a specific disease type (for example, PROC FREQ).  It may be worthwhile concentrating on only the more common disease

        types for further analyses.

  2. Try to cluster the different disease types.  For example, even though PROC VARCLUS is designed for interval/ratio continuous variables, it may yield useful

       disease clusters for your dichotomous disease type data.  You might then concentrate on only the more common clusters of disease types for further analysis.

       An alternative procedure might be PROC SUMMARY, though 30 dichotomous disease types (implying 2**30 possible combinations with the NWAY option) may choke it.

Then, with the reduced number of disease types or disease clusters, you would then use PROC GLIMMIX to analyze your data further.

issac
Fluorite | Level 6

@ 1zmm

Thanks for the response. I should declare two point here. First is that for multiple disease types, we only have one value of response, equivalently, if we construct a single column named disease type with 30 classes, for one row, there is one value for the response but more than one value for disease type. Dealing with this issue needs to find a way to identify the contribution percent each disease has on the response. This problem happens due to the fact that the PCRVU (response) pulled out from a data warehouse different than disease types. Second, in case of addressing the first issue, we end up having a set of (30) PCRVU for each row, each of which is described a portion of (the initial) PCRVU accounted for by a specific disease, so we may see this as a multivariate problem with 30 responses (with possible correlations with each others) but now without disease columns.

This viewpoint has me ask for multivariate GLIMMIX rather than its univariate version.  

1zmm
Quartz | Level 8

The current version of PROC GLIMMIX (9.3) allows only univariate responses (for example, PCRVU).  As far as I know, only

PROC GLM allows multivariate responses.

If what you want is somewhat equivalent to the R-squared statistic in multiple linear regression to determine "the percentage

of the variability in the dependent variable, PCRVU, 'explained' by the independent variables in the regression model", then

you might Google the phrase, "variable importance", to look at statistical studies that have tried to tease out the contribution

of model variables in "explaining" the variability of the dependent variable.  Quasi-R-squared statistics and variable importance statistics are also available for logistic regression and proportional hazards regression.

A crude, not fully acceptable, method for doing this for your application might be the following:

  1. Run a regression model with all the disease variables and other potentially confounding variables to determine

         the maximum R-squared statistic "explaining" the variability in the dependent variable for this model.

  2. Run a regression model with only the other potentially confounding variables but none of the disease variables

         to determine the minimum R-squared statistic "explaining" the variability in the dependent variable by these other variables

         for this model.

  3. Run 30 other regression models with the potentially confounding variables and all but one of each of the 30 disease variables

         to determine the R-squared statistics "explaining" the variability in the dependent variable by the potentially confounding variables

         and 29 of the 30 disease variables.

  4. Calculate the difference between the maximum R-squared statistic from model #1 and each of the 30 R-squared statistics from

         the 30 model #3's.

  5. Calculate the difference between the maximum R-squared statistic from model #1 and the minimum R-squared statistic in model #2.

  6. Use either the calculated difference from #4 above or that difference divided by the difference from #5 above as your measure

         of the "contribution" of a specific disease type to the variation in your dependent variable.

A problem with this method is that the sum of these contributions for all 30 disease types may  not equal and may even exceed the difference from #5 above.

By the way, the suggestion in the previous e-mail was only to reduce the number of disease types or clusters of disease types that you would have to study.

oloolo
Fluorite | Level 6

It is doable, but you need to transpose your data into multiple rows per original equation, and add ID variable for each group. Each row with same predictors but different response value from originally different response variables. You should also specify a variable indicating corresponding distribution for each response. Using Dist=BYOBS

issac
Fluorite | Level 6

@oloolo     Thanks for your reply. It's exactly what I was looking for.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2294 views
  • 4 likes
  • 3 in conversation