I want to split a text variable into different new variables but only keeping words that I want as follow:
Original dataset:
review_ID review
0001 I am satisfied with the staff and also the restaurant
0002 The swimming pool, spa and bar are amazing
...
Keywords that I want to keep:
staff
restaurant
swimming pool
spa
bar
Dataset that I want:
review_ID review1 review2 review3
0001 staff restaurant -
0002 swimming pool spa bar
...
Can anyone help me out please?
Hello,
For further processing and since you don't know in advance how many keywords you will find for each ID, it is better to use a vertical data storage.
data have;
infile cards dlm='+' dsd;
format review_ID z4. review $80.;
input review_ID review;
cards;
0001+I am satisfied with the staff and also the restaurant
0002+The swimming pool, spa and bar are amazing
;
run;
data keywords;
length keyword $20.;
input keyword;
cards;
staff
restaurant
swimming pool
spa
bar
;
run;
proc sql noprint;
SELECT count(keyword) INTO :nk
FROM keywords;
quit;
data want (rename=(_review=review));
set have;
length _review $20.;
array kw(&nk.) $20.;
retain kw:;
if _N_=1 then do i=1 to dim(kw);
set keywords;
kw(i)=keyword;
end;
do i=1 to dim(kw);
if findw(review,strip(kw(i))) then do;
_review=kw(i);
output;
end;
end;
if _review=" " then output;
keep review_ID _review;
run;
data _NULL_;
set want;
by review_ID;
if first.review_ID then put review_ID@;
if last.review_ID then put review;
else put review@;
run;
Hello,
For further processing and since you don't know in advance how many keywords you will find for each ID, it is better to use a vertical data storage.
data have;
infile cards dlm='+' dsd;
format review_ID z4. review $80.;
input review_ID review;
cards;
0001+I am satisfied with the staff and also the restaurant
0002+The swimming pool, spa and bar are amazing
;
run;
data keywords;
length keyword $20.;
input keyword;
cards;
staff
restaurant
swimming pool
spa
bar
;
run;
proc sql noprint;
SELECT count(keyword) INTO :nk
FROM keywords;
quit;
data want (rename=(_review=review));
set have;
length _review $20.;
array kw(&nk.) $20.;
retain kw:;
if _N_=1 then do i=1 to dim(kw);
set keywords;
kw(i)=keyword;
end;
do i=1 to dim(kw);
if findw(review,strip(kw(i))) then do;
_review=kw(i);
output;
end;
end;
if _review=" " then output;
keep review_ID _review;
run;
data _NULL_;
set want;
by review_ID;
if first.review_ID then put review_ID@;
if last.review_ID then put review;
else put review@;
run;
data have;
infile cards dlm='+' dsd;
format review_ID z4. review $80.;
input review_ID review;
cards;
0001+I am satisfied with the staff and also the restaurant
0002+The swimming pool, spa and bar are amazing
;
run;
data keywords;
input keyword $20.;
cards;
staff
restaurant
swimming pool
spa
bar
;
run;
proc sql;
create table temp as
select review_id,keyword
from have as a,keywords as b
where find(review,strip(keyword),'i')
order by review_id;
quit;
proc transpose data=temp out=want prefix=review;
by review_id;
var keyword;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.