Hello,
I am working on a survey dataset and I come across this problem to count "Which is the last question a person stopped answer?"
My input data looks like:
- the 1st row, outdoor, indoor is NOT part of the dataset, just put in here to illustrate what I am hoping to do:
- Each row is a person, each column is a question (there were 300+ questions)
- And missing values are different for each question, but each group do have same pattern (e.g., for outdoor, 9 is missing, for indoor, 99 is missing, something like this)
Outdoor | Indoor | |||||||
Person_ID | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
1 | 1 | 9 | 2 | 1 | 2 | 1 | 99 | 99 |
2 | 2 | 3 | 2 | 2 | 3 | 5 | 99 | 8 |
3 | 1 | 3 | 2 | 1 | 2 | 6 | 2 | 7 |
4 | 2 | 5 | 8 | 2 | 5 | 99 | 5 | 99 |
5 | 1 | 6 | 5 | 1 | 6 | 7 | 6 | 7 |
6 | 9 | 1 | 3 | 9 | 6 | 99 | 4 | 3 |
7 | 9 | 2 | 9 | 9 | 6 | 1 | 5 | 1 |
8 | 1 | 1 | 1 | 1 | 6 | 5 | 6 | 5 |
9 | 2 | 2 | 2 | 2 | 6 | 6 | 6 | 6 |
10 | 1 | 1 | 1 | 1 | 99 | 99 | 99 | 8 |
11 | 2 | 9 | 2 | 2 | 99 | 7 | 99 | 99 |
12 | 1 | 9 | 1 | 1 | 99 | 99 | 99 | 99 |
13 | 9 | 9 | 9 | 9 | 3 | 1 | 3 | 1 |
14 | 9 | 9 | 9 | 9 | 3 | 3 | 3 | 3 |
- My research question is "Which is the last question a person stops answer?", E.g.,
I want my output data to look like (a frequency table)
- N is # of people who stopped at a question that falls into this category
N | % | |
Outdoor | ||
indoor | ||
Food | ||
drink | ||
… |
The challenge I am facing are:
- How to count 'Stopped answering", e.g., in the example input data, person 1 have missing data on Q2, Q7, Q8, but Q8 is where he stopped, so each person only has 1 question
something like:
Person_ID | STOP | STOP_GROUP |
1 | Q7 | indoor |
2 | Q7 | indoor |
3 | Q8 | indoor |
4 |
- There are 300+ columns and 20+ 'groups', and the column/variable names are not as simple as Q1 - Q300+, they each have wired name (XHV, DWE, DIO....), I made them as Q1 - Q30 just for demonstration
Any idea what is the best approach for creating the about 'ideal output' frequency table?
Any help will be greatly appreciated!
Thank you so much!
Yijun
Are you sure Person2 stopped answering at Q7?
2 | 2 | 3 | 2 | 2 | 3 | 5 | 99 | 8
|
Your table suggests he/she has answered Q8???????
Thanks @novinosrin for your question.
I just thought about it and you are right, in this fake data I put together, yes, Person2 didn't stop answering at Q7.
Really good point, which brings more questions about how to best account for ' at which question a person stops answering'. I translate it into 'which is the last question that has a missing value', but now that I think, it should be 'which is the last question that has a missing value AND all questions after that don't have valid value'
Really appreciate your help!
You wrote -'which is the last question that has a missing value AND all questions after that don't have valid value'
If the above holds true, then it's a piece of cake
data have;
input Person_ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8;
cards;
1 1 9 2 1 2 1 99 99
2 2 3 2 2 3 5 99 8
3 1 3 2 1 2 6 2 7
4 2 5 8 2 5 99 5 99
5 1 6 5 1 6 7 6 7
6 9 1 3 9 6 99 4 3
7 9 2 9 9 6 1 5 1
8 1 1 1 1 6 5 6 5
9 2 2 2 2 6 6 6 6
10 1 1 1 1 99 99 99 8
11 2 9 2 2 99 7 99 99
12 1 9 1 1 99 99 99 99
13 9 9 9 9 3 1 3 1
14 9 9 9 9 3 3 3 3
;
data want;
set have;
array t(*) Q:;
do n=dim(t) to 1 by -1;
if (t(n) in (99,9)) then do;
stop=vname(t(n));
return;
end;
end;
run;
And to address -
- There are 300+ columns and 20+ 'groups', and the column/variable names are not as simple as Q1 - Q300+, they each have wired name (XHV, DWE, DIO....), I made them as Q1 - Q30 just for demonstration
Use Variable lists with a double --
XHV, DWE, DIO
can be written as array t(*) XHV-- DIO; *xhv your first variable and dio is last variable in the example list;
I have dealt with this exact question previously though I did not have SAS at the time and had to use less flexible tools.
There are some questions that answers to would help.
1) Are all of the "question" answers numeric or all character or do you have numeric character responses intermixed?
2) Are all of the question variables actually named Qxx where xx is some numeric value AND the value is at least in order as asked in the question instrument. (I ask as I have seen questionnaires that change over time and then next thing you know Q121 is the insert after Q12).
If all of the question variables are of the same type, numeric or character then this should work (as long as you don't already have a variable named QUEST to conflict with the array name)
data want; set have; array quest q: ; /* alternatively: array quest q1 -- lastquestionvariablname;*/ do i= dim(quest) to 1 by (-1); if not missing(Quest[i]) then do; LastQuest = vname(quest[i]); leave; end; end; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.