I have dataset with a many survey responses and many variables. One variable a string variable named Q3 in which the values look like this:
1,2
1,3,12
1,2,3,12,7,8,9
2,3,9
... (yes, the values are not entirely in order, though this is probably not a big deal)
I would like to parse this into a set of binary dummy variables, one for each possible value in Q_3, say Q3_1, Q3_2, Q3_3, ... Q3_12. For example, the values of Q3_9 for the data above would be:
0
0
1
1
I've played with some do loops using the SCAN function, but I have not been able to make it work. Can't find a forum question or help doc that addresses quite this situation.
The trimming is so that trailing spaces stored in the original variable do not cause false negatives. You could also just include space as one of the delimiters.
data want ;
set have;
array q3_ [12] ;
do index=1 to dim(q3_);
q3_[index]=0 ne indexw(q3,cats(index),' ,');
end;
drop index;
run;
The ARRAY statement will create 12 variables named Q3_1 to Q3_12.
The CATS() function will convert the numeric index into a string.
The INDEXW() or FINDW() function will locate that string in the list.
The comparison will check whether or not it was found.
The programming is easier than it seems. However, it is possible that your data never includes some possible answers and it would be up to you to supply a complete list. Are you prepared to do that?
Thanks for your reply. I'm not sure I understand what you mean. The term "value" here is a little ambiguous. It could mean the numbers between the commas, or it could mean any combination of these in a comma-delimited list.
In any case, the values are from from the back-end of a web-based survey instrument (which I did not design, or I would not be in this situation). I know what the possible values are. I have several survey items set up like this, each with a different number of possible responses.
Just to clarify, you would need to know the number of possible survey responses for each question.
Can we assume that the survey question answers are integers that run from 1 to as many as needed?
There will be an array statement for each question, along the lines of:
array q3_ {12};
This statement would create Q3_1 through Q3_12. So it is necessary to know for each question how many variables must be created.
Yes, I know the number and range of numbers that represent the possible responses for each item.
Try the FINDW() function.
data want ;
set have;
array q3_ [12] ;
do index=1 to dim(q3_);
q3_[index]=0 ne findw(q3,cats(index),',','t');
end;
drop index;
run;
Obs q3 q3_1 q3_2 q3_3 q3_4 q3_5 q3_6 q3_7 q3_8 q3_9 q3_10 q3_11 q3_12 1 1,2 1 1 0 0 0 0 0 0 0 0 0 0 2 1,3,12 1 0 1 0 0 0 0 0 0 0 0 1 3 1,2,3,12,7,8,9 1 1 1 0 0 0 1 1 1 0 0 1 4 2,3,9 0 1 1 0 0 0 0 0 1 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0
This looks exactly like it! I will try with my real data tomorrow! Thanks!
This appears to have worked perfectly!
There's a whole lotta trimmin' goin' on in there, but if I read the code right, the main action is findw. Can you tell me if I understand this?
Basically it creates a variable name and assigns a value based on a logical evaluation of the statement:
0 ne [position of index in the string]
...so if the index does not exist in the string, findw returns 0, 0 ne 0 evaluates false, and the value 0--for false--is assigned to the variable.
...but if the index does exist, findw returns a number greater than 0, 0 ne [not 0] evaluates true, and the value 1--for true--is assigned to the variable.
...? Thanks again!
The trimming is so that trailing spaces stored in the original variable do not cause false negatives. You could also just include space as one of the delimiters.
data want ;
set have;
array q3_ [12] ;
do index=1 to dim(q3_);
q3_[index]=0 ne indexw(q3,cats(index),' ,');
end;
drop index;
run;
The ARRAY statement will create 12 variables named Q3_1 to Q3_12.
The CATS() function will convert the numeric index into a string.
The INDEXW() or FINDW() function will locate that string in the list.
The comparison will check whether or not it was found.
Do you have access to the software that collected the data? Several of the survey data collection programs I have used have options for different layouts of export data sets. A common one is for multiple response questions to have the output as indicators such as
Q3_1 = 1 if the first response is chosen or Q3_2=0 if the second response choice is NOT chosen.
My first choice would be to go the collection software and see about such options.
A second question for you in your "dummy variables" is do you need to capture the ORDER that responses were chosen? That may also be available as an option from the collection software.
By all the responses, as @Astounding requested, we would need to have a list for every single variable what the exact choices might be. Your example for Q3 shows values: 1, 2, 3, 7, 8, 9, 12. Do values 4,5,6,10,11 possibly exist? How about 13,14, 15?? and possibly special values like 99 for "refused to answer"? Every single possible code that could appear for Q3 is needed to code properly. And for every single other question that has such a list.
I have worked with survey data for over 25 years and from the example you show I would never assume that because you have 3 and 7 in the data that codes of 4,5, and 6 a valid responses. I have seen questions get reused over years with changing codes due to the responses of previous years and seen multiple response lists with as many as 60 such codes. So I would want every single value that is expected explicitly stated before recoding values. But also, as mentioned above, I would start with the software export options to save a lot of potential headaches.
Thanks for your thoughtful reply.
As I said above, I know all the possible response values for every item. If an unnecessary variable were to be generated, I can easily delete or ignore it.
You're right that I could have played with the survey software (it's Qualtrics). I didn't think of that. I generally take any opportunity to learn new SAS skills. In any case, this instrument (which is reused periodically) is on the block for revisions soon, and a cleanup of these ridiculous multiple-response items is definitely part of that!
Tom's solution using findw works perfectly.
Also, I see that you are right that it would obviously be better to feed a discrete and correct list of the values to a script. In this case, because we are going to revise the instrument, I hope I never have to use this code again! Also, there is a check on spurious variables, in that the codebook--which I have to make myself--will not contain those values, and I will see this when I apply labels and such. Thanks again for your thoughts!!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.