Hi all,
I am working with forest inventory data where some field plots were not visited in the field. I would like to impute the missing values of forest cover within the plots. I have tried using PROC MI (SAS ver. 9.4), but keep getting the message: "ERROR: Fewer than two analysis variables". In the inserted code NFI2KMCL and ssu form an identifier of the plot, "forest" is wether the plot has been identified as forest (can be 1 or 2) and "A_forest" is the measured forest area, that is sometimes missing and needs to be imputed (values can only be 0 to 0.0706 hectar as the circular plots have a radius of 15 m).
Hope that someone can help me out!
Thomas
data NFI;
input nfi2kmcl ssu $ forest A_forest;
cards;
2km_6396_588_EUREF89 C 0 0
2km_6396_588_EUREF89 E 0 0
2km_6070_660_EUREF89 E 1 0.070685835
2km_6070_660_EUREF89 G 1 0.018552237
2km_6070_662_EUREF89 A 1 .
2km_6070_662_EUREF89 G 1 .
2km_6070_666_EUREF89 A 1 .
2km_6070_666_EUREF89 G 1 .
2km_6070_672_EUREF89 C 1 0.070685835
2km_6070_672_EUREF89 E 1 0.070685835
2km_6070_672_EUREF89 G 1 0.070685835
2km_6070_688_EUREF89 A 1 0.070685835
2km_6070_688_EUREF89 E 1 0.070685835
2km_6070_688_EUREF89 G 1 0.070685835
2km_6080_524_EUREF89 C 1 0
2km_6080_524_EUREF89 E 1 0.070685835
2km_6080_526_EUREF89 A 1 .
2km_6080_526_EUREF89 G 1 .
2km_6080_528_EUREF89 A 1 .
2km_6080_528_EUREF89 C 1 .
;
proc mi data=NFI seed=501213 nimpute=6 min=0 max=0.070686 out=NFI_out;
mcmc;
var A_forest;
by forest;
run;
What if you let it work over two variables:
data NFI;
input nfi2kmcl :$20. ssu $ forest A_forest;
cards;
2km_6396_588_EUREF89 C 0 0
2km_6396_588_EUREF89 E 0 0
2km_6070_660_EUREF89 E 1 0.070685835
2km_6070_660_EUREF89 G 1 0.018552237
2km_6070_662_EUREF89 A 1 .
2km_6070_662_EUREF89 G 1 .
2km_6070_666_EUREF89 A 1 .
2km_6070_666_EUREF89 G 1 .
2km_6070_672_EUREF89 C 1 0.070685835
2km_6070_672_EUREF89 E 1 0.070685835
2km_6070_672_EUREF89 G 1 0.070685835
2km_6070_688_EUREF89 A 1 0.070685835
2km_6070_688_EUREF89 E 1 0.070685835
2km_6070_688_EUREF89 G 1 0.070685835
2km_6080_524_EUREF89 C 1 0
2km_6080_524_EUREF89 E 1 0.070685835
2km_6080_526_EUREF89 A 1 .
2km_6080_526_EUREF89 G 1 .
2km_6080_528_EUREF89 A 1 .
2km_6080_528_EUREF89 C 1 .
;
proc mi data=NFI seed=501213 nimpute=6 min=0 max=0.070686 out=NFI_out;
mcmc;
var forest A_forest;
run;
Well, then it works of course, but the intention was to impute only the variable of interest. If I use some other variable just to make it run, that variable will affect the result of the imputation ... at least as far as I understand it.
Thomas
It's my guess that MI uses the second variable as some kind of "help". But I'm no statistician, maybe @Rick_SAS can provide more insight.
MI is meant to impute based on a multivariate distribution and thus needs more than 1 variable.
Are there any other SAS procedures made for the single variable imputation that you coud recommend using instead?
Thanks for the rply
Thomas
In the case of one variable, MI is similar to bootstrap resampling. For each imputed sample, you can replace each missing value with a random value from the nonzero values. For example, when forest=1, your data has
1 value of 0
1 value of 0.018552237
8 values of 0.070685835
It's not clear to me what you want to do with the forest=0 data, which doesn't have missing values. Copy it over to each imputed set?
Anyway, for the forest=1 data, you can write a program such as the following to replace missing values with a random observed value:
/* initial distribution of values */
proc freq data=NFI;
where forest=1;
tables A_forest / missprint;
run;
/* multiple imputations of the forest=1 data */
data Impute;
call streaminit(54321);
array Value[3] _temporary_ (0.070685835, 0.018552237, 0);
array Prob[3] _temporary_ (0.8, 0.1, 0.1);
set NFI(where=(Forest=1));
ObsNum = _N_;
do _Imputation_ = 1 to 5;
if x = . then do;
i = rand("Table", of Prob[*]);
A_forest = Value[i];
end;
else ;
output;
end;
run;
proc sort data=Impute;
by _Imputation_ ObsNum;
run;
/* final distribution of values accross all imputed sets */
proc freq data=Impute;
tables A_forest / missprint;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.