BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kduggan
Calcite | Level 5

Hello,

 

I am trying to singly-impute missing data using stochastic regression using proc mi in SAS 14.1.  Here is a sample of my script.  (Only var1 has any missingness, by sheer luck):

 

proc mi data = INPUTFILE out = OUTPUTFILE minimum = 0.00 maximum = 1.00 nimpute = 1 seed = 123456;
var VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 RACE RISK VAR11 VAR12 VAR13
VAR14 VAR15 VAR16;
class RACE RISK;
fcs nbiter = 1 reg (VAR1 = VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 RACE RISK VAR11 VAR12 VAR13
VAR14 VAR15 VAR16);
run;

 

I am using a seed so that the procedure generates the same output data file every time.  

 

I noticed that when I sort the file by race, despite not having SAS impute by race (so, just proc sort data = inputfile; by race; run; before the proc mi statement) that I get different values imputed in the output file than when it is sorted by ID.  From what I understand, the sorting of the input data set does not matter for proc mi.  

 

At first I thought maybe SAS was imputing the file separately by race, but when I added a "by race;" command to proc mi, the estimates were different again.  So, I don't think it's doing this.

 

To summarize my estimates:

1. one set of estimates when the file is sorted by ID before proc mi.

2. one set of estimates when the file is sorted by race before proc mi.

3. a third set of estimates when I add a "by race" command to proc mi.

 

How the file is sorted should not matter if I don't have a "by" command in proc mi, right?  For example, in the SAS documentation for proc mi (https://support.sas.com/documentation/onlinedoc/stat/141/mi.pdf), on page 5890, it says "note that the input data set does not need to be sorted in any order."

 

Can anyone help me understand why I'm getting different estimates, if this is consequential, and how consequential it might be?  Would the estimates somehow be less valid if the file were sorted by race instead of ID, despite not having a "by" statement within the proc mi command?  

 

Thank you for your time.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Doesn't the word STOCHASTIC imply randomness?  If you give it the same seed so that it randomly selects the same observation numbers but you have sorted the observations in a different order then it will use different values in the calculations.

View solution in original post

5 REPLIES 5
art297
Opal | Level 21

I am NOT an expert regarding proc mi, but I would think that your using cs nbiter = 1 is what is causing your differences. I'm not sure what the documentation means by "number of burn in iterations", but I have to think it is limiting the number of records used for the regression. If that is the case, then different sort orders would definitely be expected to produce different results.

 

HTH,

Art, CEO, AnalystFinder.com

 

kduggan
Calcite | Level 5

Hi Art,

 

Thank you so much for your feedback.  Clearly, I am not an expert on MI either.  I have traditionally used other approaches (e.g., maximum likelihood) with missing data, so MI is new to me too.  

 

That's a good point with the nbiter.  What I thought that was doing was running one single imputation, and I agree that the "number of burn in iterations" wording is a little unclear.  I don't think the nbiters command is causing the different estimates, though.  I removed that command and re-ran the proc mi, first sorting by ID and then again by race, and once again I get different estimates from both procedures.  But still, it was worth a shot, and at least something has been narrowed down!

Dongfang
Calcite | Level 5
i got the same issue, different data orders before imputation generate different different imputed values. Could you please keep posted about this?
Tom
Super User Tom
Super User

Doesn't the word STOCHASTIC imply randomness?  If you give it the same seed so that it randomly selects the same observation numbers but you have sorted the observations in a different order then it will use different values in the calculations.

kduggan
Calcite | Level 5

Hi Tom,


Thanks so much for your reply!  That's what I had figured it was doing (somehow imputing differently depending on the order of observations), but it is helpful to know that that is what others think is going on.  Indeed, that seems non-consequential in the long term in terms of one set of imputed values being somehow more or less valid than the other for future analyses.  In support of that, Ms and SDs of scores among imputed participants only are very close (within .01 of each other), even though scores aren't necessarily comparable for any one participant.  

 

Thank you, again, for your help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1914 views
  • 4 likes
  • 4 in conversation