BookmarkSubscribeRSS Feed
Sampark
Calcite | Level 5

 Hello all, I am working with a data set that was collected through in peson surveys. There are a lot of missing data points (for some variables it is a little more than 50%) so I am deciding to do multiple imputation on my data set to deal with my mising data. Before I explain my problem, I'll explain what my variables are. I have variables labeled q_1 through q_13, which correspond to 13 questions that respondents answered to, with their answers coded as numbers. For example, if someone chose option 2 for question 1, then that respective cell would read "2" under the q_1 column. I also have the categorical variables "Religion" (Listed as 1 for "Catholic" and 5 for "No religious affiliation"), "Age", where each number corresponds to an age range (1: 18-30 years old, 2:30-35 years old, etc), "Gender", where 1=Male and 2=Female, and "Location", where their location is explicitly listed as a string, e.g. Wyoming or Utah.
If my coding for these variables are not good, then perhaps I should change them first? The reason why Religion has numbers 1 and 5 as opposed to 0 or 1 is because in the original set, there were 5 options, but a large percentage of respondents were either Catholic (option 1) or had no religous affiliation (option 5) so we changed ever other reply as a missing value. In any case, I do not think that this neccesarrily has an effect on my PROC MI procedure, but I could certainly be wrong.
 
I used this as a reference to PROC M from this UCLA seminar on multiple imputation: https://stats.idre.ucla.edu/sas/seminars/multiple-imputation-in-sas/mi_new_1/:

Here is the example code given in the seminar:
proc mi data= new nimpute=10 out=mi_mvn seed=54321;
var socst science write read female math progcat1 progcat2;
run;
 
Following the author's following syntax I did the following:
proc mi data=survey nimpute=10 out=mi_mvn seed=54321;
var q_1 q_2 q_3 q_4 q_5 q_6 q_7 q_8 q_9 q_10 q_11 q_12 q_13
$Religion $Age $Gender $Location;
run;
 
and I got the following message in my log:
NOTE: PROCEDURE MI used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MI_MVN may be incomplete. When this step was
stopped there were 0 observations and 0 variables.
8 $Religion $Age $Gender $Location;
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, ;, -,
:, _ALL_, _CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 76-322: Syntax error, statement will be ignored.
9 run;
 
which is strange because when i use PROC PRINT to print the generated dataset "mi_mvn" it appears that I get the imputed dataset. Furthermore, when I simply do
proc mi data=survey nimpute=10 out=mi_mvn seed=54321;
run;
 
I get the following tables: Model Information, Missing Data Patterns, EM (Posterior Mode) Estimates, Variance Information, and Parameter Estimates
 
So something must be wrong with my VAR statement, but what? I did import the dataset before running PROC MI and it prints fine. What do you guys think? Every resource I've seen follows the syntax given above, so what am I doing wrong?
 

EDIT: To be clear, I am trying to use the Pearson Chi Square Test to see if there are any statistically significant differences in the responses to each question due to Religion, Location, and Gender. After combing through the internet for a bit I see that any explanation on how to do this is relatively sparse. Is what I am attempting to a bit advanced?

3 REPLIES 3
Reeza
Super User

Why do you have '$' signs in your VAR statement. Those are not required. Note that the error starts at the $ sign in your code.

 

proc mi data=survey nimpute=10 out=mi_mvn seed=54321;
var q_1 q_2 q_3 q_4 q_5 q_6 q_7 q_8 q_9 q_10 q_11 q_12 q_13 $Religion $Age $Gender $Location;
run;
Sampark
Calcite | Level 5

 

okay, good to know, thank you. After removing the "$" however, I get this error message:

 

ERROR: Variable q_1 should be either numeric or in CLASS list.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.MI_MVN may be incomplete. When this step was

stopped there were 0 observations and 0 variables.

WARNING: Data set WORK.MI_MVN was not replaced because this step was

stopped.

 

So now it seems that something is wrong with my q_1 variable (and possibly my other "q" variables). Is it because they are numeric when they are supposed to represent categorical variabls?

Reeza
Super User

@Sampark wrote:

 

 

So now it seems that something is wrong with my q_1 variable (and possibly my other "q" variables). Is it because they are numeric when they are supposed to represent categorical variabls?


I *think* it's because PROC MI imputes numerical values, not categorical values. Character variables are assumed to be categorical and should be specified in the CLASS statements. I would suggest verifying this by reviewing the documentation.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1317 views
  • 0 likes
  • 2 in conversation