Re: Generate binary values after fitting logistic model

spramanik · Posted 08-08-2011 02:18 PM

Hi all,

I have a large dataset, having 8 million observations. I have fitted a logistic model and obtained the predicted probabilities using PROC LOGISTIC. Now using the predicted probabilities, I would like to generate the 0-1 values corresponding to each observation. I was trying to do that in PROC IML using the RANDGEN function within a loop. Rick Wicklin's blog (April 4, 2011) suggest similar solution for independent normal distribution. But it's taking for ever. When I do the same thing for a subset (2 million) of the dataset, it works reasonable fast (overnight). Is there a better way to this? Note that the predicted probabilities are different for each observation.

Similar problem occurs when I try to generate from normal with different mean in PROC IML. Any guidance would be highly appreciated.

Santanu

Rick_SAS · Posted 08-08-2011 02:44 PM

Maybe I'm confused. I think all you have to do to get the groups is to assign group=1 when the predicted value is greater than 0.5 and group=0 when the predicted value is less than 0.5. I don't see why RANDGEN comes into play or why there would be a loop. Here is some code that generates some fake data and calls logistic to output predicted probabilities. The PROC IML code just assigns 1 or 0 depending on the predicted probabilities. You can use the DATA step to do the same thing.

data a(drop = i prob);
call streaminit(321);
do i = 1 to 1000;
   x = rand("normal");
   prob = exp(x) / (1 + exp(x));
   y = rand("Bernoulli", 1-prob);
   output;
end;
proc logistic data=a;
model y(event='1') = x;
output out=out pred=pred;
run;

proc iml;
use out; read all var {pred y}; close out;
class = (pred>= 0.5);
print (sum(class=y));

spramanik · Posted 08-08-2011 02:59 PM

Hi Rick,

Thanks for your reply. What if all the predicted probabilities are greater than 0.5 or less than 0.5? Still there is a chance of of an observation getting assigned to a different group, right? I would like to incorporate that uncertainty by generating from Bernoulli.

This is similar to that of fitting a model using PROC MIXED. I can get the predicted (EBLUP) values from PROC MIXED, but those are unrealistically smooth values. I would like to obtain the predicted values in two steps: first generate a value (say, mu) from normal with mean=synthetic (Xbeta_hat) and common variance=random effect variance component estimate, then generate from normal with mean=mu and common variane=residual variance estimate.

Santanu

sgruber · Posted 09-14-2011 02:19 PM

Hi Santanu,

You can generate a random uniform for each observation (U), then set OUTCOME = 1 if prob > U, 0 otherwise.

--Susan

Generate binary values after fitting logistic model