Hello All,
I am working on a data set that has 800,000 records, of which only 80 are Events (Target = 1) and all the remaining records are non-events.
I do not want to do oversampling (Taking all the events observations and matching with equal number of non-events, as I will just have 160 records).
so, I decided to do weighting. That is weighed up all the events and weighed down all the non-events to make the proportion of events to non-events 50:50, using a weight variable called good_bad_wgt which I used in my logistic regression.
proc logistic data = dummies outset = est;
model Target (event = '1') = %goodvariables/selection = stepwise slstay = 0.05 slentry = 0.05;
weight = good_bad_wgt;
run;
What I want to know is:
1. Are the resulting probabilities over-estimated?
2. If so, How do I adjust the probabilities.
If someone can help me better understand how the weight statement in Proc logistic works, I would really appreciate it.
Thanks.
When you use a WEIGHT statement, the weights are used to determine the parameter estimates (="fit the model"). After the parameters are estimated, the procedure does not need or use any additional weights to score data (="evaluate the model based on the values of the explanatory variables"). The predicted probabilities, CIs, etc, are determined solely by the parameter estimates.
Does not the documentation for the WEIGHT statement explain all of this?
In the link you shared, It says "Weights do not affect the computation of predicted probabilities, their confidence limits, or the predicted response level"
Is this correct?
Does it mean the probabilities obtained from proc logistic are the True probabilities and need not be adjusted (i.e. offset)?
Thanks.
@praneeth09m248 wrote:
In the link you shared, It says "Weights do not affect the computation of predicted probabilities, their confidence limits, or the predicted response level"
Is this correct?
If that's what SAS says in the documentation, then I believe it is correct. Why would I believe anything else?
Does it mean the probabilities obtained from proc logistic are the True probabilities and need not be adjusted (i.e. offset)?
Well, um ...
"true probabilities" and "need not be adjustet (i.e. offset)" are terms that I really don't know what they mean.
They are estimates of the probabilities, given the model (and some simple assumptions). Estimates are never "true probabilities" the way I would use the phrase, but perhaps you are using "true probabilities" differently than I would use it.
@praneeth09m248 wrote:
when we oversample the data, probabilities obtained are overestimated, to bring the probabilities back to its original values, we adjust the intercept. (I refer to the probabilities obtained after the adjustment as True probabilities).
Wait! Now you are oversampling? I don't see where that comes from, you certainly haven't explained that. Furthermore, the weights you provide in the WEIGHT statement ought to eliminate the oversampling.
Furthermore, it is highly likely that SAS does any adjustments needed under the hood, so you the human user don't have to take the results and adjust them further. But do I know that for sure? No, because I have never dug into it; although that's a very good thing for a statistical analysis package to do, and SAS is a very highly regarded statistical analysis package.
@praneeth09m248 wrote:
In the link you shared, It says "Weights do not affect the computation of predicted probabilities, their confidence limits, or the predicted response level"
Is this correct?
No, your interpretation is not correct. The paragraph that you quote from begins "If a SCORE statement is specified, then...." The entire paragraph is talking about the SCORE statement and the statement that you quote refers ONLY to data specified on the SCORE statement, not to the data used to fit the model. During the fitting of the model, the weights determine the parameter estimates and therefore affect the predicted probabilities and CLs. However, the sentence that you quote indicates that after the model is fit, then the procedure scores the model based only on the values of the explanatory variables.
If you have weights shouldn't you be using PROC SURVEYLOGISTIC?
Since you’ve weighted the obs your odds ratios and estimates may not reflect the actual probabilities. Your weighting approach sounds a bit like setting up prior probabilities.
That may be a more intuitive approach.
https://support.sas.com/resources/papers/proceedings14/SAS400-2014.pdf
Both would get the same parameter estimators but different standard error , with weight variable or not.
I remembered @Rick_SAS has written a blog about it for PROC REG before.
If you have too small probability of event.
Two choice:
1) oversample, otherwise your model would not be trusted.
2) try other distribution like : Poisson Distribution, Negative Binomial Distribution .
@Ksharp is probably referring to the article
"The difference between frequencies and weights in regression analysis"
Another relevant article is
"How to understand weight variables in statistical analyses"
which explains the differences between the analytical weights that PROC LOGISTIC uses and the survey weights that PROC SURVEYLOGISTIC uses.
When you use a WEIGHT statement, the weights are used to determine the parameter estimates (="fit the model"). After the parameters are estimated, the procedure does not need or use any additional weights to score data (="evaluate the model based on the values of the explanatory variables"). The predicted probabilities, CIs, etc, are determined solely by the parameter estimates.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.