07-13-2015 12:41 PM
In our research project we have other fields for several questions where participants can write in an answer if it is not listed as an option. As you can imagine we get a lot of responses that are spelled wrong or the same with different wording. I know SAS can do fuzzy matching to combine data sets, can this be done within a character variable. For instance, if the responses are:
"Back pain" "Back problems" "back aches"
they would all be considered the same. Additionally, some participants will list several things, such as:
"back pain and hip pain" "back and neck pain"
I would want these to be included in the back pain category as well. Is there a way to do this without manually going through everything? There are about 20,000 responses so I am trying to streamline the process.
Thank you in advance for your help.
07-13-2015 01:02 PM
I have little expertise in questionnaire analysis but I would be interested to see proposed solutions evaluated against the human solution: hiring someone to interpret those unlisted answers.
07-13-2015 01:58 PM
Are we talking 20k open-ended responses to a single question or 20k responses with some open-ends or 20k open-ends between many different questions?
What do you want to do with this?
My first inclination is to see how many of the open-ends should be recoded to existing categories of responses.
Since I've never worked somewhere with time/money/desire to try to build expert systems we pretty much relied on the Mark I human eye.
First, make the open-ends all lower or upper case and run proc freq on them to see how many overall and clusters (such as start with "back" or "pain").
Next would be to use make lists by respondent and response so that we could easily add the code for coding back to code values. With care that can be easily made into an update data set or one per question typically.
After that step you have a rough idea how many times some types of topics you may be considering for coding to a new code value.
Set a threshold of a minimum number of related responses to justify a new code value. This threshold depends on any types of analysis or summaries planned. With that threshold now look at (remaining after recoding above) the responses again and see if there are enough to meet that threshold. If so recode the responses to the new code(s). If not leave alone or possibly correct typos.
I've done this for survey projects approaching 57,000 responses with one question approaching a 20% open-end response (after years of adding categories).
07-13-2015 02:14 PM
For this it is just one question with about 20,000 open ended responses since it is a select all that apply and includes a open text field for additional responses. In the past I've generally relied on doing a quick scan of the responses to see which are most frequent to add new categories to our questionnaires. You mention the reasons I would like to do this, the first is to fill in responses of other categories that were listed if a participant missed that it said something like "asthma" in the list and wrote it in instead. With so many other responses and misspellings this would take a tremendous effort and unfortunately we don't have the time/resources either. The other reason is to determine which other responses have the highest frequencies to prioritize new response options, though, as you mention, this can be done by just looking through the responses.
I will try what you suggested to create an update dataset. Thank you!
07-13-2015 02:22 PM
With a data step you can make a workable update dataset with some like:
set have (keep = <list of variables need to uniquely id respondent> q1openend);
where not missing(q1openend);
q1 = .;
/* if the original Q1 is text then use q1=''; so types match later. If the codes are longer than 8 you'll need a Length statement*/
q1 is a generic place holder for your variable name.
With viewtable in Base SAS you can look at the data and enter the appropriate code. Then use your preferred method for update.