Solved: Scoring a new dataset in PROC MIXED

spramanik · Posted 08-08-2011 02:39 PM

Hi,

I have fitted a linear mixed model using SAS PROC MIXED. Using the fitted model, is it possible to obtain the predicted values (synthetic and EBLUP both) for a new dataset (the way we do it using the SCORE command in PROC LOGISTIC)? By new dataset I mean new values of the covariates (other than what we have in the dataset that was used for modeling). One way to do it is to derive the design matrix corresponding to the new dataset and then use matrix multiplication of the deisgn matrix and the estimated model parameters from PROC MIXED in PROC IML. For large design matrix, I am having memory allocation issues in PROC IML. Then I tried to do the calculation one row at a time within a loop and the memory issue is resolved. Is there a better way to do it in the DATASTEP?

Thanks,

Santanu

Rick_SAS · Posted 08-11-2011 08:44 AM

Here's how to do it, but I hope you realize that this is not actually predicting anything. What you are doing is simulating Y, given the predicted probabilities. A prediction would use a statement such as PredY = (p>0.5);

Anyway, here's what you asked for.

/* Simulate random values for p. Pretend the PRED data is the output from LOGISTIC */
data pred;
drop j;
do j = 1 to 8e6;
p = rand("Uniform"); output;
end;
run;

/* use predicted values to simulate
SimY ~ Bernoulli(p) */
data Sim;
call streaminit(123);
set pred;
SimY = rand("Bernoulli", p);
run;

View solution in original post

Rick_SAS · Posted 08-08-2011 03:07 PM

An interesting idea, but I think there is an easier way.

For simple models like REG and GLM, you can use the SCORE procedure to score new data. You use the OUTEST= option (reg) or the OUTSTAT= option (GLM and other procs) to create an output data set that holds the parameters estimates that PROC SCORE will use.

For MIXED, the model is more complicated. The way to score a mixed model is to use the STORE statement to output the model, and them use the PLM procedure to score the model on new data. The PLM procedure was introduced in SAS 9.2M3 for exactly this purpose. See http://support.sas.com/resources/papers/proceedings10/258-2010.pdf

spramanik · Posted 08-10-2011 12:06 PM

Hi Rick,

Many thanks for your reply. I asked my office to install SAS 9.2 in my PC inorder to score a new dataset using PROC MIXED. Do you have any better solution in your mind to generate values from Bernoulli distribution with different probabilities? Unlike R, SAS does not accept vector values of probabilities.

R code

> my.p<- c(0.05,0.1,0.5,0.75,0.99)

> rbinom(length(my.p),size=1,prob=my.p)

[1] 0 0 1 0 1

> rbinom(length(my.p),size=1,prob=my.p)

[1] 0 0 0 1 1

> rbinom(length(my.p),size=1,prob=my.p)

[1] 0 0 0 1 1

Only way to do it in SAS is within a loop and for 8 million records it took me 7 days (along with another set of generated values from normal with different means). You probably have solution for virtually everything in SAS.

Santanu

Rick_SAS · Posted 08-10-2011 01:18 PM

No loop over observtions is required. Like R, SAS/IML has the ability to fill many random values with a single call. Go to http://blogs.sas.com/iml and browse some of the articles under the "Sampling and Simulation" topic in the right-hand sidebar. For a particularly easy example, see http://blogs.sas.com/iml/index.php?/archives/9-Efficient-Sampling.html. Also see p. 29 and Chapter 13 of my book, Statistical Programming with SAS/IML Software.(Ch. 2 is available as a free download.)

For your example, you say you want to sample from Bernoulli (output will be 0 or 1), but in R you are sampling from a binomial distribution (with "number of trials"=size=1) to get this. In SAS, you can sample from Bernoulli directly.

Here's how you can do this in SAS/IML. It takes about 0.8 seconds:

proc iml;

call randseed(123);
p = {0.05,0.1,0.5,0.75,0.99};
/* each column is for a separate probability */
x = J(8e6, nrow(p)); /* 8 million obs */

/* Use y to fill with random Bernoulli draws with prob=p */
y = x[,1];
t0 = time(); /* time the computation */
do i = 1 to nrow(p);
call randgen(y, "Bernoulli", p);
x[,i] = y; /* assign into the i_th column of x */
end;
t = time()-t0;
print t; /* print elapsed time = 0.8 seconds */

spramanik · Posted 08-10-2011 03:01 PM

Hi Rick,

Thanks for your reply. Probably I have confused you again by treating p as a vector of length 5. That was just an example to show what R can do but SAS connot. Ideally I would like to generate 8 million values from Bernoulli having 8 million different probabilities. This is how I do it in R:

> t1<- Sys.time()

> x<- runif(n=8e6, min=0, max=1)

> length(x)

[1] 8000000

> summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.2499 0.4999 0.5000 0.7500 1.0000

> y<- rbinom(8e6, size=1, prob=x)

> length(y)

[1] 8000000

> summary(y)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 1.0000 0.5001 1.0000 1.0000

> t2<- Sys.time()

> t2-t1

Time difference of 5.563 secs

To do it in SAS you have to run a loop (following your suggestion) and it is still running in my PC and would take days, probably, to finish. Your example code took few seconds because you ran the loop 5 times (sorry for the confusion, again) instead of 8 million times.

proc iml;

call randseed(123);

p = J(8e6, 1);

call randgen(p, "Uniform");

/* each column is for a separate probability */

x = J(8e6, 1); /* 8 million obs */

t0 = time(); /* time the computation */

do i = 1 to nrow(p);

call randgen(x, "Bernoulli", p);

end;

t = time()-t0;

print t;

quit;

Rick_SAS · Posted 08-10-2011 03:32 PM

No offense, but I wish you'd stop saying "SAS can't do this" and "SAS takes days to run." The following code runs in less than two seconds:

data b;

drop j;

call streaminit(123);

do j = 1 to 8e6;

p = rand("Uniform");

x = rand("Bernoulli", p);

output;

end;

run;

proc print data=b(obs=10);run;

proc means data=b mean std q1 median q3;

var x; run;

I need to do some other work now. Good luck with your project.

spramanik · Posted 08-10-2011 09:53 PM

I am really sorry if I offend you. For a moment I forgot that I am seeking advice from an experienced SAS user and initiated an unnecessary comparison between R and SAS. If you don't mind, let me frame my question one more time.

I have a dataset that has 8 million records. I have fitted a logistic regression model and stored the predicted probabilities in an output SAS dataset. This output dataset has a unique ID variable, 0-1 dependent variable, some covariates, and the predicted probabilities. Now I want to generate binary values (predicted) corresponding to each record, based on their predicted probabilities. I tried to do that using the RANDGEN function in PROC IML, using a loop, and it takes a long time. Is there any efficient way to do it in DATASTEP or in PROC IML?

Thanks,

Santanu

Rick_SAS · Posted 08-11-2011 08:44 AM

Here's how to do it, but I hope you realize that this is not actually predicting anything. What you are doing is simulating Y, given the predicted probabilities. A prediction would use a statement such as PredY = (p>0.5);

Anyway, here's what you asked for.

/* Simulate random values for p. Pretend the PRED data is the output from LOGISTIC */
data pred;
drop j;
do j = 1 to 8e6;
p = rand("Uniform"); output;
end;
run;

/* use predicted values to simulate
SimY ~ Bernoulli(p) */
data Sim;
call streaminit(123);
set pred;
SimY = rand("Bernoulli", p);
run;

spramanik · Posted 08-12-2011 03:07 PM

Hi Rick,

Thank you so much for your reply. You are right about the difference between prediction and simulation. I really want to simulate values based on the predicted (estimated) parameters. I did some checking on the simulated 0-1 values and it seems to be working fine.

Simlar thing I want to do for normal districution also. That is to say, I would like to simulate from normal distribution (in two levels, as I am considering a mixed effects model) with different mean parameters but consatnat sd. If you please have a look at the folowing code (part 2) to assure that the code does exactly what I want to do, that would be very helpful. The code appears to be long but it should be relatively easier for you to go through it.

/*Part1: Fit the logistic model to simulate synthetic 0-1 values*/

proc logistic data=nonfed noprint;

class size naics2;

model E_pos(event='1') = size naics2;

output out=E_binary_pred p=phat_E;

run;

data Sim1;

set E_binary_pred;

E_pred_pos = rand("Bernoulli", phat_E);

run;

/*Part2: Fit the Linear Mixed Model for employment, given that it's positive*/

data E_positive;

set nonfed;

if E_pos = 0 then lAvgE=.;

run;

proc mixed data=E_positive method=reml;

class size own naics2 state;

id CvID EvID county naics;

/*log employment(average of M1, M2, M3) has been modeled */

model lAvgE = size own naics2 MSAid / s OUTPM=E_LMM_XBetaHat;/*store the synthetic predicted values Xbeta_hat*/

random state / s subject=state type=VC;

ods output CovParms = covparms_mixed_E;/*covparms_mixed_E contains variance component estimates*/

run;

proc iml;

use covparms_mixed_E; read all var {Estimate} into sigma2Est; close covparms_mixed_E;

sigmav_hat = sqrt(sigma2Est[1]);

sigmae_hat = sqrt(sigma2Est[2]);

use E_LMM_XBetaHat; read all var {Pred} into mu; close E_LMM_XBetaHat;

n = nrow(mu);

sigmae = repeat(sigmae_hat, n);

sigmav = repeat(sigmav_hat, n);

use nonfed;

read all var {AvgE AvgWage wages MSAid};

read all var {CvID EvID own size naics state county naics2};

close nonfed;

create E_LMM_muHat var {CvID EvID own size naics state county naics2 AvgE AvgWage wages mu sigmav sigmae};

append;

quit;

data Sim2;

set E_LMM_muHat;

x2 = rand("Normal", mu, sigmav);

x3 = rand("Normal", x2, sigmae);

E_pred = exp(x3);

run;

data synth_E_selstate_datastep1;

merge Sim1 Sim2;

by EvID;

run;

data synth_E_selstate_datastep;

set synth_E_selstate_datastep1;

E_synth = E_pred_pos*E_pred;

E_synth = round(E_synth,1);

run;

Thanks again,

Santanu

Rick_SAS · Posted 08-10-2011 01:28 PM

By the way, you can also sovle this problem in the DATA step. It takes about 2.5 seconds to generate the random values and to write all 8 million obs to a data set:

data b;

drop p1-p5 i j;

call streaminit(123);

array p [5] (0.05 0.1 0.5 0.75 0.99);

array x [5] x1-x5;

do i = 1 to 8e6;

do j = 1 to 5;

x{j} = rand("Bernoulli", p{j});

end;

output;

end;

run;

proc print data=b(obs=10);run;

proc means data=b; run;

Registration is open