BookmarkSubscribeRSS Feed
PanShuyao
Fluorite | Level 6

How to get posterior parameters used in each imputation with monotone regression method?

 

I wonder how to get the parameters used in each imputation when we apply monotone regression method in multiple imputation procedure. Let us use the example dataset as follows,

data Fish1;
   title 'Fish Measurement Data';
   input Length1 Length2 Length3 @@;
   datalines;
23.2 25.4 30.0    24.0 26.3 31.2    23.9 26.5 31.1
26.3 29.0 33.5    26.5 29.0   .     26.8 29.7 34.7
26.8   .    .     27.6 30.0 35.0    27.6 30.0 35.1
28.5 30.7 36.2    28.4 31.0 36.2    28.7   .    .
29.1 31.5   .     29.5 32.0 37.3    29.4 32.0 37.2
29.4 32.0 37.2    30.4 33.0 38.3    30.4 33.0 38.5
30.9 33.5 38.6    31.0 33.5 38.7    31.3 34.0 39.5
31.4 34.0 39.2    31.5 34.5   .     31.8 35.0 40.6
31.9 35.0 40.5    31.8 35.0 40.9    32.0 35.0 40.6
32.7 36.0 41.5    32.8 36.0 41.6    33.5 37.0 42.6
35.0 38.5 44.1    35.0 38.5 44.0    36.2 39.5 45.3
37.4 41.0 45.9    38.0 41.0 46.5
;
run;

Then we apply monotone regression method provided by `proc mi`.

proc mi data=Fish1 nimpute=5 seed=769097 out=outex3;
monotone reg(Length2= Length1 / details);
var Length1 Length2;
ods output MonoReg=MonoReg;
run;

We can see the result of estimated parameters used in first imputation are -0.044703 (Intercept) and 0.982951 (Length1).

 

According to the supporting document, regression method is based on Rubin's book (1987, Multiple Imputation for Nonresponse in Surveys, pp. 166–167). The specific steps are demonstrated here.

 

I followed the steps with code shown as below, but I can't get the same result of the parameters used in the imputation. I just repeat all the steps one time, so only one result was gotten. Presented by beta_star, my results were -0.04867 (Intercept) and 0.9804691 (Length1).

 

What's wrong with my code? 

/* Extract records with missing Length2 */
data Length2_missing;
    set fish2;
    if missing(Length2);   
    keep Length1;          
run;

/* Extract observed records (non-missing Length2) for regression modeling */
data Length2_observed;
    set fish2;
    where not missing(Length2);   
    keep Length1 Length2;         
run;

/* Fit a regression model: Length2 ~ Length1 using GENMOD */
proc genmod data=Length2_observed;
model Length2 = Length1 / dist=normal link=identity; /* Normal distribution with identity link */
ods output
ParameterEstimates=beta_est(where=(Parameter='Intercept' | Parameter='Length1'));
output out=residuals
pred=pred_y /* Predicted values */
resraw=raw_res /* Raw residuals */
reschi=pearson_res /* Pearson residuals */
resdev=deviance_res; /* Deviance residuals */
run; /* Calculate residual variance (s^2) and degrees of freedom (df) */ proc sql noprint; select count(*) into :n from Length2_observed; /* Total observed sample size (n) */ select &n - 2 into :df from Length2_observed; /* Degrees of freedom: df = n - p (p=2 parameters: intercept + Length1) */ quit; /* Bayesian posterior sampling for β using IML (Matrix Language) */ proc iml; /* Read regression coefficients (β_hat) */ use beta_est; read all var {'Estimate'} into beta_hat; /* Extract β estimates */ close beta_est; /* Read residuals to compute Sum of Squared Errors (SSE) */ use residuals; read all var {'raw_res'} into RAW; /* Raw residuals */ RESIDUAL = RAW##2; /* Squared residuals */ SSE = (1/(&n - 2)) * sum(residual[, 1]); /* Residual variance estimator: s^2 = SSE / df */ close residuals; /* Construct design matrix X (with intercept) */ use Length2_observed; read all var {'Length1'} into X; /* Predictor variable */ close Length2_observed; X = j(nrow(X), 1, 1) || X; /* Add intercept column (1s) */ /* Cholesky decomposition of covariance matrix for sampling */ U = root(inv(X` * X)); /* Cholesky root of (X'X)^(-1) */ /* Set random seed for reproducibility */ call randseed(769097); /* Initialize random number generator */ call streaminit(769097); /* Synchronize random stream */ /* Generate random normal vector Z ~ N(0, I) */ Z = randnormal(1, {0, 0}, {1 0, 0 1}); /* Z: 1x2 vector from standard normal */ /* Sample posterior sigma2 from inverse chi-square distribution */ sigma2 = SSE * &df / rand('chisq', &df); /* sigma2 ~ Inv-Chisq(df, SSE) */ /* Sample β_star from posterior: β_star = β_hat + σ * U * Z */ beta_star = beta_hat + sqrt(sigma2) * U * Z`; /* Posterior draw of coefficients */ print beta_star; /* Display sampled β_star */ /* Save sampled β_star to a dataset */ create beta_sample from beta_star[colname={'beta_star'}]; append from beta_star; close beta_sample; quit;
5 REPLIES 5
FreelanceReinh
Jade | Level 19

Hello @PanShuyao,

 

I haven't checked more details, but my first guess is that the difference you observed is due to the fact that PROC MI and PROC IML use different random number generators (RNG): The documentation of the SYSRANEND Macro Variable mentions that PROC MI uses the "older RNG, which is used by the RANUNI function", whereas both CALL RANDSEED and CALL STREAMINIT use the Hybrid 1998/2002 32-bit Mersenne twister as the default (and they don't even provide an option to use the old RNG). So, using the same random seed (769097) or not, the random number streams will be totally different.

 

Addendum:

Of course, make sure that you are using the same input dataset. Currently, you have FISH1 and FISH2.

 

For a better comparison of the two approaches, you could run both on multiple copies of the input dataset (ideally avoiding loops) and then see if the mean values of the statistics of interest converge. On the part of PROC MI the analysis of 1000 copies of dataset FISH1 could look like this:

proc surveyselect data=fish1 out=morefish reps=1000 rate=1;
run;

ods select none;
ods noresults;

proc mi data=morefish nimpute=5 seed=769097 out=outex3_m;
by replicate;
monotone reg(Length2= Length1 / details);
var Length1 Length2;
ods output MonoReg=MonoReg_m;
run;

ods select all;
ods results;

proc means data=monoreg_m mean std clm;
class effect;
var _1;
run;
PanShuyao
Fluorite | Level 6

Hi~@FreelanceReinh, thank you for your answer. Now I use RANNOR and RANGAM in PROC IML to generate normal distributed data and chi-square distributed data. But the results are still different from PROC MI. I think RANNOR and RANGAM use the older RNG.

FreelanceReinh
Jade | Level 19

@PanShuyao wrote:

I think RANNOR and RANGAM use the older RNG.


This is true, but I'm not surprised that the results still don't match. The (pseudo-)random numbers depend on the order of operations and at least I don't know how PROC MI works internally, so it is not obvious how to replicate its computations manually.

 

Please see the addendum I added to my first reply for a more promising approach.

SAS_Rob
SAS Employee

It looks like to me that you have the steps correctly specified.  The only issue is that you did not standardize the data prior to running the GENMOD step.  This would effect the betas as well as the MSE.  I would recommend using REG instead of GENMOD as well so you can easily pick off the Sigma0_2 using the RMSE.

PanShuyao
Fluorite | Level 6

Thank you for your answer. I forgot to include my standardization code here. Actually, the dataset currently used, named fish2, is the result after standardization. The standardization code:

/*Standardize all variables in the dataset named fish1*/
proc stdize data=fish1 out=fish2;
	var Length1 Length2 Length3;
run;

 

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 482 views
  • 0 likes
  • 3 in conversation