About SlutskyFan

SlutskyFan · ‎11-02-2017

So I have a flag for claims > 0 (CLAIM_PRE) and a continuous variable for claims (MEDRX_PRE). It would seem that in a two part model the first modle would be predicting CLAIM_PRE (the separate process for 1'sand 0's) and the next model statement for the positive claims. I am not sure how to set this up in the FMM framework. When I enter the following I get lots of errors about conflicting outcome/model statements. PROC FMM DATA=WORK.HC_MBR_ANLY_23OCT17; MODEL MEDRX_PRE = ACTIV_PARTICIP / DIST=BINARY; MODEL MEDRX_PRE = ACTIV_PARTICIP / DIST=GAMMA; MODEL CLAIM_PRE = / DIST=CONSTANT; RUN; Should the data be structured in some way or the PROC specified so that FMM can handle both the binary model to predict occurance of claims and the conditonal on positive gamma model? I don't see how the same response variable can be used in each model statement if the entire purpose of two part models is to model two different responses? Thanks.

SlutskyFan · ‎03-07-2017

This example seems to give SE and CI for each value at which a ME is calculated. If we are interested in average marginal effects, other examples use PROC MEANS in combination with PROC QLIM to average all of the MEs to get the desired result. (https://support.sas.com/rnd/app/ets/examples/margeff/index.html) Would we get the correct standard errors and confidence intervals simiply using proc means as such: PROC MEANS DATA=OUTNL MEAN STDERR CLM; VAR PRED; RUN; Or is there another way to leverage NLMIXED in a such a way to produce average marginal effects and their standard errors? I'm thinking to get standard errors for marginal effects calculated this way some sort of loop/bootstrap approach might be necessary? I know this is an old discussion but looking for a solution today and I ran across this string. Thanks.

SlutskyFan · ‎05-09-2014

I have count data and am interested in quantile regression. There has been some work in this area using ' jittering' which adds a random uniform component to values creating a continuous dependent variable which can then be used in QR. I'm not sure how this impacts things like standard errors etc. There are packages in R and Stata that handle this, but I'm curious if anyone has attemtped this in SAS or has any suggestions. A couple references: Reforming health care: Evidence from quantile regressions for counts Rainer Winkelmann Journal of Health Economics 25 (2006) 131–145 "Basically, the approach transforms the discrete data problem into a continuous data problem by adding a random uniform variable to each count. The quantile regression functions of the transformed variable can then be estimated using standard quantile regression software. To interpret the results, one can compare the freely estimated quantile functions to those implied by the respective Poisson or negative binomial estimates in order to detect excess sensitivity in specific parts of the distribution, such as the lower or upper tails." See also: Machado, J.A.F. and Santos Silva, J.M.C. (2005), Quantiles for Counts, Journal of the American Statistical Association, vol. 100, no. 472, pp. 1226-1237.

SlutskyFan · ‎03-10-2014

I think my previous post was too long or unclear so here is a pithy restatement of my problem: I have about 20K obseravations in my training data. What is the best way to test a predictive model (in terms of fit/accuracy/generalaization error) when I will be scoring very small cohorts (n~5-7) in an actual implementation/production environment? Would the initial ROC score on a 80/20 percent split of my training and validation data be relevant? Insead would it be better to code some method that assesses performance across a number of small test samples (n=7-10) through some sort of cross validation or series of boostrap samples? Thanks.

SlutskyFan · ‎03-07-2014

This is very much an applied business/predictive modeling problem. Typically when I have developed predictive models in the past I have trained on a large sample of data (i.e. n~10,000) and performed validation and testing on smaller out of sample data sets (n~3000) that are representative of the size of population I will ultimately score in a production environment. If model performance holds across these data sets, and if in production the model continues to perform well, I feel that I have a useful model. Typically my models classify subjects into different risk pools. They are not perfect. Sometimes a few people in my 'low risk' group will experience an event, sometimes a few in my high risk group will never experience an event. However, these 'misclassifications' if you want to call them that are within an acceptable (practical) range of tolerance in the production environment, which again involves scoring 'live' cohorts (n~3000). But now I have been asked, what if I build and validate a model using similar proportions of training, test, and validation data as before, but in the production environment I have very small cohorts (n~5-10). Can I trust my model performance on the much larger training, validation, and test data sets when I'm scoring such a small number of subjects in production? For instance, with the previous business problem, I might put 250 subjects in the high risk group, but 75 actually never experience the event based on the test data. We might devote resources to 250 subjects when only 175 really needed intervention. However the harms from intervention are minimal, and economies of scale put this rate of error within the range of practical tolerance. I'm just not confident that I can go to scoring 7-10 people in a production environment basing model performance on training, test, and validation data sets in the proportion that I have described. And with such a small cohort, cost per subject of intervention are much higher, and an error rate that would be within the range of tolerance before may no longer be acceptable in such a small space. So, I want to know anyone's thoughts on how to deal with this problem. Should I continue to train and validate on larger data sets, but test on very small data sets? Because of the uncertainty involved, should I validate on a number of small holdout data sets (almost like assessing the genralization error across a bootstrapped sample of say 500 holdout samples of size 10?) I typically work in a SAS Enterprise Miner environment using gradient boosting, but also have used logistic regression in SAS EG for some of these projects. Any suggestions would be helpful. Maybe I'm missing the (random) forest for the (decision) trees? That was a joke.

SlutskyFan · ‎02-27-2014

gIt is apparent that PROC CALIS does not provide a means to generate path diagrams related to structural equation models. This makes collaboration between research groups difficult because it is hard to communicate model specifications and makes reporting tedious. Is anyone aware of a SAS macro that will convert PATH statements to diagrams? Or, can anyone suggest an additional piece of software that will make it much easier and quicker to consstruct these diagrams to supplement analysis in SAS? I think JMP has options but it is not an option for me to use at this time. Thanks.

SlutskyFan · ‎12-26-2013

I too have had a similar dilemma with RD designs especially in terms of estimation in SAS. I've combed the internet as well as past SAS Global Forum papers and found very little. I understand RD designs as a quasi-experimental method in the context of Angrist and Pishke. They loosely claim that even a basic regression with a dummy intercept capturing the discontinuity - see more details here on my interpretation of this) will do a pretty good job capturing the treatment effect. In terms of estimation in SAS, I have compared results from the RD function in R (which allows for more complicated local linear regression techniques and bandwitdth selection similar to STATA) and compared the results to an estimation in SAS using PROC GLM or GLIMMIX with interactions and obtained very similar estimation of the treatment effect. I also try to keep most of my analysis in SAS for production analytics and interoperability with other analysts. Again, I'm thinking of RD as a quasi-experimental method that identifies the treatment effect using the quasi-experimental variation generated by near random treatment assignment near the cutoff Xo. Your description does seem like an analysis of a discontinuity, but not exactly in a context I'm familiar with. Could your treatment effect also be estimated using other methods that exploit discontinuities such as a difference-in-difference approach or interrupted time series? I also have had the same issue with DD as RD in terms of good references doing the estimation in SAS. Interrupted time series is much more proliferate in terms of examples and documentation. I'm very new with these methods so I'm interested in learning more from responses that hopefully will follow.

SlutskyFan · ‎10-10-2013

I don't know the answer, but that is an interesting question to me as well. I have been using PROC GLIMMIX to do inverse probability of treatment weighted regressions using the WEIGHT statement using the EMPIRICAL option to get heteroskedastictiy consistent standard errors. From correspondence with SAS tech support the WEIGHT statement in GLIMMIX gives the correct weighting for IPTW regression but I did not ask about the standard errors. I assumed that they were the correct ones. I'm not sure if they are or if GLM and GLIMMIX compute them the same way when the WEIGHT statement is invoked. I wish I knew.

SlutskyFan · ‎11-19-2012

Do you open SAS GRAPH Network Visualization Workshop from within the base SAS enhanced editor (the coding environment)?

SlutskyFan · ‎11-19-2012

Does network visualization studio require additional licensing, or is it available by default with SAS graph?

SlutskyFan · ‎11-15-2012

THat's what I'll do. As a side note, I have SAS 9.3 TS1MO. I'm not sure what version of SAS IML I actually have.

SlutskyFan · ‎11-14-2012

Thanks again Rick. I've run into this issue because I'm actually playing around with some SNA stuff. I'm also looking at incorporating some of your code from your post: The power method: compute only the largest eigenvalue of a matrix - The DO Loop

SlutskyFan · ‎11-14-2012

It has been a while since I inquired about this, but I found that gradient boosting was very useful! Thanks.

SlutskyFan · ‎11-14-2012

I'm not sure why I get the following error when I run the code that follows: ERROR: Invocation of unresolved module NORM. proc iml; x = {1 2, 3 4}; print x; mn1 = norm(x, "L1"); mnF = norm(x, "Frobenius"); mnInf = norm(x, "LInf"); print mnF; print mn1; run; quit;

SlutskyFan · ‎06-30-2011

Replying to my own post, I've found the following to be interesting and relevant to my origianl question. My sample sizes seem to be in the ranges involved in their simulations. Crit Care Med. 2007 Sep;35(9):2052-6. Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited. Kramer AA, Zimmerman JE. http://www.ncbi.nlm.nih.gov/pubmed/17568333 MEASUREMENTS AND MAIN RESULTS: Data sets of 5,000, 10,000, and 50,000 patients were replicated 1,000 times.Logistic regression models were evaluated for each simulated data set. Thisprocess was initially carried out under conditions of perfect fit (observedmortality = predicted mortality; standardized mortality ratio = 1.000) andrepeated with an observed mortality that differed slightly (0.4%) frompredicted mortality. Under conditions of perfect fit, the Hosmer-Lemeshow testwas not influenced by the number of patients in the data set. In situationswhere there was a slight deviation from perfect fit, the Hosmer-Lemeshow testwas sensitive to sample size. For populations of 5,000 patients, 10% of theHosmer-Lemeshow tests were significant at p < .05, whereas for 10,000patients 34% of the Hosmer-Lemeshow tests were significant at p < .05. Whenthe number of patients matched contemporary studies (i.e., 50,000 patients),the Hosmer-Lemeshow test was statistically significant in 100% of the models. CONCLUSIONS: Caution should be used in interpreting thecalibration of predictive models developed using a smaller data set whenapplied to larger numbers of patients. A significant Hosmer-Lemeshow test doesnot necessarily mean that a predictive model is not useful or suspect. Whiledecisions concerning a mortality model's suitability should include theHosmer-Lemeshow test, additional information needs to be taken intoconsideration. This includes the overall number of patients, the observed andpredicted probabilities within each decile, and adjunct measures of modelcalibration.

Online Status	Offline
Date Last Visited	‎11-02-2017 11:02 AM

Re: Two part model for health care costs

Re: How to calculate standard errors of marginal effects

Has anyone attempted to use jittering to accomodate quantile regressio...

Model Validation on Very Small Test Data Sets II (to the point)

Model Validation on Very Small Test Data Sets

PATH diagrams to support SEM

Re: How to use Sharp Regression Discontinuity (SRD) in SAS?

Re: Standard Errors in Weighted Regression using PROC GLM

Re: SAS GRAPH Network Visualization

SAS GRAPH Network Visualization

Re: NORM function in PROC IML

Hosmer Lemeshaw Test and Large Samples

Re: Two part model for health care costs

Re: How to calculate standard errors of marginal effects

Has anyone attempted to use jittering to accomodate quantile regressio...

Model Validation on Very Small Test Data Sets II (to the point)

Model Validation on Very Small Test Data Sets

PATH diagrams to support SEM

Re: How to use Sharp Regression Discontinuity (SRD) in SAS?

Re: Standard Errors in Weighted Regression using PROC GLM

Re: SAS GRAPH Network Visualization

SAS GRAPH Network Visualization

Re: NORM function in PROC IML

Re: NORM function in PROC IML

Re: Random Forests in Enterprise Miner

NORM function in PROC IML

Hosmer Lemeshaw Test and Large Samples