Hi folks;
Have been doing on a project aims to predict physician workload (in terms of Primary Care Relative Value Unit) based on some patient demographic attributes and, more importantly, illness types. One problem, or I though to be a problem, is that for each row (patient), there is a set of 1s and 0s across disease columns (1= the patient had the illness during the past year and had got visited for it) but there is only one cell for PCRVU accounted for all physician time spent on total visits over the past year. So we cannot say how much of PCRVU is related to, say Diabetes, and how much of it is related to, say Vascular problem, and so on. Now we think the job can be done in two levels, first is how to split PCRVU among the disease types and second how to analyze divided PCRVUs with multivariate GLIMMIX. The reason for selecting GLIMMIX lies in the fact that there is no guarantee that the PCRVU follows approximately normal and also we do have to include such random effects like administrative region (VISN) for
in order to account for potential heteroscedasticity and possible correlations within clusters of patients. Now I 'm asking whether there is a way to address this problem in SAS 9.3. Any helpful comments would be highly appreciative.
P.S. a sample excel file is attached. ACC001 through ACC030 are the disease types.
Thanks!
Issac
It is doable, but you need to transpose your data into multiple rows per original equation, and add ID variable for each group. Each row with same predictors but different response value from originally different response variables. You should also specify a variable indicating corresponding distribution for each response. Using Dist=BYOBS
You might first consider some simple ways to assess the different disease types and to reduce their number to something more manageable:
1. Calculate the number of patients who have a specific disease type (for example, PROC FREQ). It may be worthwhile concentrating on only the more common disease
types for further analyses.
2. Try to cluster the different disease types. For example, even though PROC VARCLUS is designed for interval/ratio continuous variables, it may yield useful
disease clusters for your dichotomous disease type data. You might then concentrate on only the more common clusters of disease types for further analysis.
An alternative procedure might be PROC SUMMARY, though 30 dichotomous disease types (implying 2**30 possible combinations with the NWAY option) may choke it.
Then, with the reduced number of disease types or disease clusters, you would then use PROC GLIMMIX to analyze your data further.
@ 1zmm
Thanks for the response. I should declare two point here. First is that for multiple disease types, we only have one value of response, equivalently, if we construct a single column named disease type with 30 classes, for one row, there is one value for the response but more than one value for disease type. Dealing with this issue needs to find a way to identify the contribution percent each disease has on the response. This problem happens due to the fact that the PCRVU (response) pulled out from a data warehouse different than disease types. Second, in case of addressing the first issue, we end up having a set of (30) PCRVU for each row, each of which is described a portion of (the initial) PCRVU accounted for by a specific disease, so we may see this as a multivariate problem with 30 responses (with possible correlations with each others) but now without disease columns.
This viewpoint has me ask for multivariate GLIMMIX rather than its univariate version.
The current version of PROC GLIMMIX (9.3) allows only univariate responses (for example, PCRVU). As far as I know, only
PROC GLM allows multivariate responses.
If what you want is somewhat equivalent to the R-squared statistic in multiple linear regression to determine "the percentage
of the variability in the dependent variable, PCRVU, 'explained' by the independent variables in the regression model", then
you might Google the phrase, "variable importance", to look at statistical studies that have tried to tease out the contribution
of model variables in "explaining" the variability of the dependent variable. Quasi-R-squared statistics and variable importance statistics are also available for logistic regression and proportional hazards regression.
A crude, not fully acceptable, method for doing this for your application might be the following:
1. Run a regression model with all the disease variables and other potentially confounding variables to determine
the maximum R-squared statistic "explaining" the variability in the dependent variable for this model.
2. Run a regression model with only the other potentially confounding variables but none of the disease variables
to determine the minimum R-squared statistic "explaining" the variability in the dependent variable by these other variables
for this model.
3. Run 30 other regression models with the potentially confounding variables and all but one of each of the 30 disease variables
to determine the R-squared statistics "explaining" the variability in the dependent variable by the potentially confounding variables
and 29 of the 30 disease variables.
4. Calculate the difference between the maximum R-squared statistic from model #1 and each of the 30 R-squared statistics from
the 30 model #3's.
5. Calculate the difference between the maximum R-squared statistic from model #1 and the minimum R-squared statistic in model #2.
6. Use either the calculated difference from #4 above or that difference divided by the difference from #5 above as your measure
of the "contribution" of a specific disease type to the variation in your dependent variable.
A problem with this method is that the sum of these contributions for all 30 disease types may not equal and may even exceed the difference from #5 above.
By the way, the suggestion in the previous e-mail was only to reduce the number of disease types or clusters of disease types that you would have to study.
It is doable, but you need to transpose your data into multiple rows per original equation, and add ID variable for each group. Each row with same predictors but different response value from originally different response variables. You should also specify a variable indicating corresponding distribution for each response. Using Dist=BYOBS
@oloolo Thanks for your reply. It's exactly what I was looking for.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.