Hello- I am trying to count the number of negative and positive words within each "Comment" observation.
Variables: Comment: A list of a website comments
Negative: A list of negative associated words, each observation is one negative word
Positive: A list of positive associated words, each observation is one positive word
I have gathered the information and combined them into one data set, now I am confused how to scan/index each comment observation when I'm looking to see if it contains any word in an entire list and eventually count them to something like this:
Does anyone have any insight?
Comment Negative Positive Rating
"Wow so great" 2
"It's OK" 1
"Good but boring" 0
"Meh" -1
Bad
Meh
boring
Wow
Great
OK
Good
Conveniently this problem becomes another showcase of Hash() power:
data good;
input good$;
cards;
wow
great
ok
good
better
best
;
data bad;
input bad$;
cards;
bad
meh
boring
never
;
data have;
input comment $50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;
data want;
if _n_=1 then do;
if 0 then set good bad;
declare hash g(dataset:'good', multidata:'y');
g.definekey('good');
g.definedone();
declare hash b(dataset:'bad', multidata:'y');
b.definekey('bad');
b.definedone();
end;
set have;
length s $ 50;
do i=1 by 1 until (missing(s));
s=lowcase(scan(comment,i,'", '));
do rc_good=g.find(key:s) by 0 while (rc_good=0);
gs+1;
rc_good=g.find_next(key:s);
end;
do rc_bad=b.find(key:s) by 0 while (rc_bad=0);
bs+1;
rc_bad=b.find_next(key:s);
end;
end;
Rating=sum(gs,-bs);
call missing(gs, bs);
keep comment rating;
run;
Haikuo
Although I agree completely with Haikuo that the hash object, and his program, are the best solution to this problem, it is also logically amenable to SQL:
data good;
input good$;
cards;
wow
great
ok
good
better
best
;
data bad;
input bad$;
cards;
bad
meh
boring
never
;
data have;
input comment $50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;
data have_words(keep=comment CandidateWord);
set have;
length CandidateWord $ 50;
do i=1 by 1 until (missing(CandidateWord));
CandidateWord=lowcase(scan(comment,i,'", '));
if ~missing(CandidateWord) then
output;
end;
run;
proc sql;
create table want_details as
select comment, 1 as rating
from have_words h inner join good g on(h.CandidateWord = g.good)
outer union corresponding
select comment, -1 as rating
from have_words h inner join bad b on(h.CandidateWord = b.bad)
;
quit;
proc means noprint nonobs nway;
class comment;
var rating;
output out=want(drop=_type_ _freq_) sum=;
run;
Tom
Hi Tom,
You should change "her" in
"Although I agree completely with Haikuo that the hash object, and her program, are the best solution to this problem, it is also logically amenable to SQL:"
to "his". - Linlin
OOPS! Thanks, Linlin. It's lost in my excellent but short memory, but somehow I ended up with the impression that our esteemed colleague shares your gender, not mine.
Well, at least I said his program is the best solution...that's gotta get me SOME points!!
Best,
Tom
Thanks, LinLin, for putting the "Y" chromosome back to my online identity. Tom, not being a female really wasn't up to me, LOL. And being a fan since 3 years ago in SAS-L, I am honored for being mentioned by your post,
Haikuo
I'd like to see a performance comparison between the methods, including this one using an informat :
data positive;
input word$;
datalines;
wow
great
ok
good
better
best
;
data negative;
input word$;
datalines;
bad
meh
boring
never
;
data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;
data emotion;
length type hlo $1 label $4 fmtname $7;
set positive(in=pos) negative(in=neg);
fmtname = "Emotion";
type = "I"; hlo = "U";
start = upcase(word);
label = put(pos - neg, best4.);
run;
proc format cntlin=emotion; run;
data want(drop=pos len i);
set have;
emoScore = 0;
call scan(comment, 1, pos, len);
do i = 2 by 1 while (pos>0);
emoScore = sum(emoScore, input(substr(comment, pos, len), ?? emotion.));
call scan(comment, i, pos, len);
end;
run;
proc print noobs; run;
PG
That would be great fun. I always like to make these somewhat real world. @sucksatsas would you be in a position to share your lists of "good" and "bad" words, the number of comments that you want to use this on, and how many words are typically in your comments?
I can generate some data that'll pound all these solutions, and I'll post the results. Hopefully over the next week or two.
Tom
Thanks, Tom, looking forward to your verdict. After spending more time with it, I have made the Hash code more succinct (not sure about the efficient improvement though). At the same time, since Tom is doing the service for us, I 'd like to add Array() approach to the pool of candidates .
data good;
input good$;
cards;
wow
great
ok
good
better
best
;
data bad;
input bad$;
cards;
bad
meh
boring
never
;
data have;
input comment $50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;
/*Hash()*/
data want_hash;
if _n_=1 then do;
if 0 then set good bad;
declare hash g(dataset:'good', multidata:'y');
g.definekey('good');
g.definedone();
declare hash b(dataset:'bad', multidata:'y');
b.definekey('bad');
b.definedone();
end;
set have;
length s $ 50;
do i=1 by 1 until (missing(s));
s=lowcase(scan(comment,i,'", '));
gs+ (not g.find(key:s));
bs+ (not b.find(key:s));
end;
Rating=sum(gs,-bs);
call missing(gs, bs);
keep comment rating;
run;
/* Array()*/
proc sql noprint;
select nobs into :bad_obs from dictionary.tables where LIBNAME='WORK' AND MEMNAME='BAD';
SELECT NOBS INto :GOOD_OBS FROM DICTIONARY.TABLES WHERE LIBNAME='WORK' AND MEMNAME='GOOD';
QUIT;
data want_array;
if _n_=1 then do;
do i=1 to bobs;
set bad nobs=bobs point=i;
array b(&bad_obs) $10. _temporary_;
b(i)=bad;
end;
do i=1 to gobs;
set good nobs=gobs point=i;
array g(&good_obs) $10. _temporary_;
g(i)=good;
end;
end;
set have;
length s $ 50;
do i=1 by 1 until (missing(s));
s=lowcase(scan(comment,i,'", '));
gs+ (s in g);
bs+ (s in b);
end;
Rating=sum(gs,-bs);
call missing(gs, bs);
keep comment rating;
run;
Haikuo
The more the merrier. A proposal with hash and prx :
data positive;
length word $12;
input word;
datalines;
wow
great
ok
good
better
best
;
data negative;
length word $12;
input word;
datalines;
bad
meh
boring
never
;
data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;
/* Note: datasets positive and negative could also include a word specific value or
there could be a single table with positive and negative values */
data emotions;
set positive(in=pos) negative(in=neg);
word = upcase(word);
value = pos - neg;
run;
data want;
length word $12;
if prxId = 0 then do;
declare hash w(dataset:"emotions");
w.definekey("word");
w.definedata("value");
w.definedone();
call missing(word, value);
prxId + prxparse("/\w+/");
end;
set have;
start = 1;
score = 0;
call prxnext(prxId, start, -1, comment, pos, len);
do while (pos > 0);
word = upcase(substr(comment, pos, len));
if 0 = w.find() then score + value;
call prxnext(prxId, start, -1, comment, pos, len);
end;
drop prxId word value start pos len;
run;
proc print noobs; run;
PG
Hello again, after a long absence.
Sorry, everybody, as often happens life interfered with my plans for this. But I think I'm back on track. I have a test environment set up, and have run some preliminary tests. Although SAS is too fast to make the results meaningul, preliminary times are (in order of the programs that were submitted):
Hai.kuo #1
real time 4.52 seconds
user cpu time 2.69 seconds
system cpu time 0.09 seconds
Tom Kari
real time 83.41 seconds
user cpu time 6.44 seconds
system cpu time 7.27 seconds
PGStats #1
real time 2.06 seconds
user cpu time 2.01 seconds
system cpu time 0.04 seconds
Hai.kuo #2
real time 2.54 seconds
user cpu time 2.44 seconds
system cpu time 0.06 seconds
Hai.kuo #3
real time 6.48 seconds
user cpu time 6.25 seconds
system cpu time 0.03 seconds
PGStats #2
real time 3.06 seconds
user cpu time 2.93 seconds
system cpu time 0.07 seconds
I'll be ramping up the data volumes to get some meaningful times, and will report back on both times and memory usage (which should be significant for the array and hash approaches).
Talk to you soon!
Tom
PG is the winnor! Congratulations!!!
Not so fast!
1. I just finished some minor tweaking to the programs to correct some mistakes I'd made in getting them to all run off of the same data, but I don't think there will be any serious changes, but;
2. We still need to see how they perform at a volume that takes a minute or two.
So far, the only thing we're sure of is that my code is the slowest!:smileyblush:
Tom
Tom,
we change the rule to:
The slower the better.
it is very interesting question, Are you doing Context Analysis ?
data good; input good $upcase8. ; cards; Wow Great OK Good ; data bad; input bad $upcase8. ; cards; Bad Meh boring ; data have; input comment $50.; cards; "Wow so great" "It's OK" "Good but boring" "Meh" "Good Good Good Better Best, Never let it rest" ; run; data want(keep=comment score); if _n_=1 then do; if 0 then set good ; if 0 then set bad; declare hash g(dataset:'good'); g.definekey('good'); g.definedone(); declare hash b(dataset:'bad'); b.definekey('bad'); b.definedone(); end; set have; score=0; do i=1 to countw(comment, ,'ka'); temp=upcase(scan(comment,i, ,'ka')); put temp= ; good=temp; if g.find()=0 then score+1; bad=temp; if b.find()=0 then score+(-1); end; run;
Ksharp
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.