Hello,
I'm in the process of cleaning a dataset (from a survey) using SAS 9.4 and I'm trying to figure out a more streamlined process for cleaning one of the variables (called Q12). For this question, people were allowed to select more than one response, and all of their choices were combined into one variable (Q12), separated by commas. There were 14 choices that they could select. An example of what the data looks like is below:
ID Q12
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
I am trying to re-code Q12 into 14 variables corresponding to each of the question choices. For example, I would create a new variable called Q12_1 for all of the people who selected choice 1 and it would be coded 0 or 1 (0= did not select, 1 = did select), and so on and so forth. I was going to use the INDEX function to pull the choice # (1-14) in Q12, and I could just do this 14 times, but I wanted a more efficient way of doing it. Here's the code I could use:
DATA TEST;
SET CLEAN_DATASET;
if index(Q12, "1") then Q12_1 = 1;
else if Q12 = ' ' then Q12_1 = .;
else Q12_1 = 0;
RUN;
I could repeat this IF statement 14 times, swapping out the "1" for every choice #, but I want to learn how to do this using some kind of loop like an array to save myself time. However, I've been running into trouble because I can't figure out how to create a loop of the numbers 1-14 in the code below:
DATA TEST;
SET CLEAN_DATASET;
array X [14] Q12_1 Q12_2 Q12_3 Q12_4 Q12_5 Q12_6 Q12_7 Q12_8 Q12_9 Q12_10 Q12_11 Q12_12 Q12_13 Q12_14;
do i = 1 to 14;
if index(Q12, "i") then X[i] = 1;
else if Q12 = ' ' then X[i] = .;
else X[i] = 0;
end;
Drop i;
RUN;
I seem to be running into trouble with the INDEX function. When I ran this, my newly created variables only contained 0's and blanks, instead of 1's as well. I also tried adding another DO loop and a temporary array but that didn't work either. Any suggestions??
Thank you!
Ok. This gives you what you want 🙂
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
4 .
5 1,9
6 .
;
data want(drop=i j idx);
set have(rename=(Q12=_Q12));
array Q{14} Q1-Q14 (14*0);
put _Q12;
do i=1 to countw(_Q12, ',');
idx=input(scan(_Q12, i, ','), 8.);
if idx in (1:14) then Q[idx]=1;
end;
if sum(of Q[*]) eq 0 then call missing(of Q[*]);
output;
do j=1 to dim(Q);
Q[j]=0;
end;
run;
Welcome to the SAS communities 🙂
There is probably a simpler way, but here is my approach
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
;
data want(drop=i j idx);
set have(rename=(Q12=_Q12));
array Q{14} Q1-Q14 (14*0);
do i=1 to countw(_Q12, ',');
idx=input(scan(_Q12, i, ','), 8.);
Q[idx]=1;
end;
output;
do j=1 to dim(Q);
Q[j]=0;
end;
run;
Thank you very much for your help!
I tried your code, but I got this error message:
25 GOPTIONS ACCESSIBLE;
26 Data TEST(drop = i j idx);
27 Set CLEAN_DATASET(rename = (Q12 = _Q12));
28
29 Array Q[14] Q12_1-Q12_14 (14*0);
30
31 do i = 1 to countw(_Q12, ',');
32 idx = input(scan(_Q12, i, ','), 8.);
33 Q[idx] = 1;
34 end;
35
36 output;
37
38 do j = 1 to dim(Q);
39 Q[j] = 0;
40
41 end;
42
43 run;
ERROR: Array subscript out of range at line 33 column 1.
This would happen if your test answers contain values other than 1,2,3,4,5,6,7,8,9,10,11,12,13 and 14 ..
Which is not the case in your posted example data. But may be the case in your actual data. Am I correct?
A quick fix that takes care of invalid data entries could be like below.
Notice how I purposely added an invalid response number of 15 in the third observation, which is now handled. However, I would probably clean the data properly at an earlier state.
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11,15
;
data want(drop=i j idx);
set have(rename=(Q12=_Q12));
array Q{14} Q1-Q14 (14*0);
do i=1 to countw(_Q12, ',');
idx=input(scan(_Q12, i, ','), 8.);
if idx in (1:14) then Q[idx]=1;
end;
output;
do j=1 to dim(Q);
Q[j]=0;
end;
run;
Thank you, I will try this. The actual data contains missing values, which I am assuming is why there was an error? I want to keep the missing values as missing, and not code them as 0.
What does a missing value look like in this context? Two commas in a row? Please edit the data step that I created of your posted example data to shows us.
Also, how do you want to 'keep' the missing values here?
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
;
.
Sorry, I should have been more clear. A missing value in my dataset is someone who didn't answer the question at all. I want to be able to distinguish between those who didn't answer the question, and people who did answer the question but may not have selected a certain choice (I would code the choices they didn't select as 0).
Here's an example:
data have;
input ID Q12 $20.;
datalines;1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
4 .
5 1,9
6 .;
ID 4 and 6 is what a missing value would look like. If a person answered the question but didn't select a certain choice, the missing choices aren't listed (see ID 1,2,3, and 5).
I want to code things like this:
For ID 5, for example, I would code Q12_1 and Q12_9 =1 and all the other variables would be coded as 0 (since this person answered the question). But for ID 6, I would code all the variables = missing.
Well that got messed up...sorry let me try that again
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
4 .
5 1,9
6 .
;
Ok. This gives you what you want 🙂
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
4 .
5 1,9
6 .
;
data want(drop=i j idx);
set have(rename=(Q12=_Q12));
array Q{14} Q1-Q14 (14*0);
put _Q12;
do i=1 to countw(_Q12, ',');
idx=input(scan(_Q12, i, ','), 8.);
if idx in (1:14) then Q[idx]=1;
end;
if sum(of Q[*]) eq 0 then call missing(of Q[*]);
output;
do j=1 to dim(Q);
Q[j]=0;
end;
run;
This worked! Thank you!!
Anytime, glad to help 🙂
data have;
input ID Q12 $20.;
datalines;
1 1,2,3,4,5
2 4,11,13
3 6,7,8,9,10,11
4 .
5 1,9
6 .
;
data want;
set have;
array t Q12_1-Q12_14 (14*0);
retain k;
if _n_ = 1 then k = peekclong(addrlong (t[1]), 112);
else call pokelong(k, addrlong (t[1]), 112) ;
do i=1 to countw(q12);
j=scan(q12,i);
t(input(j,8.))=1;
end;
if 1 not in t then call missing(of t(*));
drop k j i;
run;
This also worked perfectly. Thank you for your help!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.