Dear community,
I have a survey questionnaire, one of the questions is to ask people to pick their most favourite fruit(s) from the list below (select all that apply)
apple |
pear |
orange |
grape |
watermelon |
banana |
The data ended like below:
pid | fruit1 | fruit2 | fruit3 | fruit4 | fruit5 | fruit6 |
1 | apple |
|
|
|
|
|
2 |
| pear |
|
|
|
|
3 |
|
| orange |
|
|
|
4 |
|
| orange |
|
|
|
5 |
|
|
| grape |
|
|
6 |
|
|
|
| watermelon |
|
7 |
| pear |
|
|
| banana |
8 | apple |
|
| grape |
|
|
9 |
|
| orange |
|
|
|
10 | apple |
|
|
|
| banana |
11 |
|
|
|
| watermelon |
|
12 |
|
| orange |
|
|
|
13 |
|
| orange |
|
|
|
14 | apple |
|
| grape |
|
|
15 |
|
|
|
| watermelon |
|
16 | apple |
|
|
|
| banana |
17 |
|
|
| grape |
|
|
18 | apple |
|
|
|
|
|
19 |
|
|
|
|
|
|
20 | apple |
|
|
|
|
|
Since this is one question in the dataset and I would think it should be presented as one variable (e.g., fruit), a few observations do have multiple selections though.
My question is how I should consolidate the data into one column (e.g., fruit)
Thanks
Personally I would be much more tempted to create normalized values such as "apple" with a 1/0 coding for selected/not selected.
A SUM of the Apple variable would indicate the number of respondents that selected apple, mean would the percentage of respondent that selected apples. You could also reduce that to a fake binary character value that would contain all of the information.
data want; set have; array fq fruit1-fruit6; array fv apple pear orange grape watermelon banana; /* set all the "standarized" variables to 0. 0 means not selected */ do i= 1 to dim(fv); fv[i]=0; end; do i= 1 to dim(fq); index= whichc(fq[i],'apple', 'pear', 'orange', 'grape', 'watermelon', 'banana'); if index>0 then fv[index]=1; end; fruitlist= catt(of fv(*)); run;
A 1 in the first position of fruitlist means that apple was selected, a 0 apple was not selected. A consistent form, has all of the data.
You could even create a custom format to indicate exactly which combination of selected values exists.
The answer depends on how you want to handle people who have two or more fruits chosen. Please tell us what you want to do in that situation.
I want to capture both, like 'apple and pear' under the fruit variable, if both 'apple' and 'pear' have been selected.
or duplicate the row with one observation as 'apple' and one observation as 'pear', if there is more than one choice has been selected fro the same person. I also want to know if there is any other way to consolidate the data
Thank you!
if it's survey data, I would keep it like
data want;
set have;
array t(*) fruit:;
selected_fruit=coalescec(of t(*));
run;
if it is more than one choice, I'm afraid you need to transpose. Either way, a long data set is always better than wide i think
Hello,
data have;
infile cards dlm=',' dsd;
length fruit1-fruit6 $10;
input id fruit1-fruit6 ;
cards;
1 , apple , , , , ,
2 , , pear , , , ,
3 , , , orange , , ,
4 , , , orange , , ,
5 , , , , grape , ,
6 , , , , , watermelon ,
7 , , pear , , , , banana
8 , apple , , , grape , ,
9 , , , orange , , ,
10 , apple , , , , , banana
11 , , , , , watermelon ,
12 , , , orange , , ,
13 , , , orange , , ,
14 , apple , , , grape , ,
15 , , , , , watermelon ,
16 , apple , , , , , banana
17 , , , , grape , ,
18 , apple , , , , ,
19 , , , , , ,
20 , apple , , , , ,
run;
data want;
set have;
keep id fruit order;
array fruits(*) fruit:;
order=1;
do i=1 to dim(fruits);
if not missing(fruits(i)) then do;
fruit=fruits(i);
output;
order=order+1;
end;
end;
run;
Yes, I saw that note, multiple choices would mean multiple records per id by virtue of transpose
data fruit;
infile cards truncover;
input pid (fruit1 fruit2 fruit3 fruit4 fruit5 fruit6) ($);
cards;
1 apple orange
2 . pear
;
data want;
set fruit;
array t fruit:;
do _i_=whichc(coalescec(of t(*)),of t(*)) to dim(t);
if t(_i_)>' ' then do; seleted_fruit=t(_i_);output;end;
end;
keep pid seleted_fruit;
run;
or
proc transpose data=fruit out=f(where=(col1 is not missing));
by pid;
var fruit:;
run;
Personally I would be much more tempted to create normalized values such as "apple" with a 1/0 coding for selected/not selected.
A SUM of the Apple variable would indicate the number of respondents that selected apple, mean would the percentage of respondent that selected apples. You could also reduce that to a fake binary character value that would contain all of the information.
data want; set have; array fq fruit1-fruit6; array fv apple pear orange grape watermelon banana; /* set all the "standarized" variables to 0. 0 means not selected */ do i= 1 to dim(fv); fv[i]=0; end; do i= 1 to dim(fq); index= whichc(fq[i],'apple', 'pear', 'orange', 'grape', 'watermelon', 'banana'); if index>0 then fv[index]=1; end; fruitlist= catt(of fv(*)); run;
A 1 in the first position of fruitlist means that apple was selected, a 0 apple was not selected. A consistent form, has all of the data.
You could even create a custom format to indicate exactly which combination of selected values exists.
Hi All,
I can't appreciated more for the help and support I got from this forum.
Your support really helped me to build up my confidence to continue the journey in learning SAS.
Thank you all !!!
Data Want;
Set Have;
FavFruit = Catx(' and ', of fruit1-fruit6);
Keep Id FavFruit;
Run;
Does this accomplish what you were trying to accomplish?
the code works perfect!
Big THANKS to you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.