Not applicable
Posts: 1

# Scanning an Observation for a Word within a Variable

Hello- I am trying to count the number of negative and positive words within each "Comment" observation.

Variables: Comment: A list of a website comments

Negative: A list of negative associated words, each observation is one negative word

Positive: A list of positive associated words, each observation is one positive word

I have gathered the information and combined them into one data set, now I am confused how to scan/index each comment observation when I'm looking to see if it contains any word in an entire list and eventually count them to something like this:

Does anyone have any insight?

Comment         Negative    Positive     Rating

"Wow so great"                                        2

"It's OK"                                                  1

"Good but boring"                                     0

"Meh"                                                     -1

Meh

boring

Wow

Great

OK

Good

Posts: 3,167

## Re: Scanning an Observation for a Word within a Variable

Conveniently this problem becomes another showcase of Hash() power:

data good;

input good\$;

cards;

wow

great

ok

good

better

best

;

cards;

meh

boring

never

;

data have;

input comment \$50.;

cards;

"Wow so great"

"It's OK"

"Good but boring"

"Meh"

"Good Good Good Better Best, Never let it rest"

;

data want;

if _n_=1 then do;

if 0 then set good bad;

declare hash g(dataset:'good', multidata:'y');

g.definekey('good');

g.definedone();

b.definedone();

end;

set have;

length s \$ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

do rc_good=g.find(key:s) by 0 while (rc_good=0);

gs+1;

rc_good=g.find_next(key:s);

end;

bs+1;

end;

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

Haikuo

PROC Star
Posts: 1,316

## Re: Scanning an Observation for a Word within a Variable

Although I agree completely with Haikuo that the hash object, and his program, are the best solution to this problem, it is also logically amenable to SQL:

data good;
input good\$;
cards;
wow
great
ok
good
better
best
;

cards;
meh
boring
never
;

data have;
input comment \$50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;

data have_words(keep=comment CandidateWord);
set have;
length CandidateWord \$ 50;

do i=1 by 1 until (missing(CandidateWord));
CandidateWord=lowcase(scan(comment,i,'", '));

if ~missing(CandidateWord) then
output;
end;
run;

proc sql;
create table want_details as
select comment, 1 as rating
from have_words h inner join good g on(h.CandidateWord = g.good)
outer union corresponding
select comment, -1 as rating
;
quit;

proc means noprint nonobs nway;
class comment;
var rating;
output out=want(drop=_type_ _freq_) sum=;
run;

Tom

Super Contributor
Posts: 1,636

## Re: Scanning an Observation for a Word within a Variable

Hi Tom,

You should change "her" in

"Although I agree completely with Haikuo that the hash object, and her program, are the best solution to this problem, it is also logically amenable to SQL:"

to "his".  - Linlin

PROC Star
Posts: 1,316

## Re: Scanning an Observation for a Word within a Variable

OOPS! Thanks, Linlin. It's lost in my excellent but short memory, but somehow I ended up with the impression that our esteemed colleague shares your gender, not mine.

Well, at least I said his program is the best solution...that's gotta get me SOME points!!

Best,

Tom

Posts: 3,167

## Re: Scanning an Observation for a Word within a Variable

Thanks, LinLin, for putting  the "Y" chromosome back to my online identity. Tom,  not being a female really wasn't up to me, LOL. And being a fan since 3 years ago in SAS-L, I am honored for being mentioned by your post,

Haikuo

Posts: 5,539

## Re: Scanning an Observation for a Word within a Variable

I'd like to see a performance comparison between the methods, including this one using an informat :

data positive;
input word\$;
datalines;
wow
great
ok
good
better
best
;

data negative;
input word\$;
datalines;
meh
boring
never
;

data have;
length comment \$100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

data emotion;
length type hlo \$1 label \$4 fmtname \$7;
set positive(in=pos) negative(in=neg);
fmtname = "Emotion";
type = "I"; hlo = "U";
start = upcase(word);
label = put(pos - neg, best4.);
run;

proc format cntlin=emotion; run;

data want(drop=pos len i);
set have;
emoScore = 0;
call scan(comment, 1, pos, len);
do i = 2 by 1 while (pos>0);
emoScore = sum(emoScore, input(substr(comment, pos, len), ?? emotion.));
call scan(comment, i, pos, len);
end;
run;

proc print noobs; run;

PG

PG
PROC Star
Posts: 1,316

## Re: Scanning an Observation for a Word within a Variable

That would be great fun. I always like to make these somewhat real world. @sucksatsas would you be in a position to share your lists of "good" and "bad" words, the number of comments that you want to use this on, and how many words are typically in your comments?

I can generate some data that'll pound all these solutions, and I'll post the results. Hopefully over the next week or two.

Tom

Posts: 3,167

## Re: Scanning an Observation for a Word within a Variable

Thanks, Tom, looking forward to your verdict. After spending more time with it, I have made the Hash code more succinct (not sure about the efficient improvement though). At the same time, since Tom is doing the service for us, I 'd like to add Array() approach to the pool of candidates .

data good;

input good\$;

cards;

wow

great

ok

good

better

best

;

cards;

meh

boring

never

;

data have;

input comment \$50.;

cards;

"Wow so great"

"It's OK"

"Good but boring"

"Meh"

"Good Good Good Better Best, Never let it rest"

;

/*Hash()*/

data want_hash;

if _n_=1 then do;

if 0 then set good bad;

declare hash g(dataset:'good', multidata:'y');

g.definekey('good');

g.definedone();

b.definedone();

end;

set have;

length s \$ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

gs+ (not g.find(key:s));

bs+ (not b.find(key:s));

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

/* Array()*/

proc sql noprint;

SELECT NOBS INto :GOOD_OBS FROM DICTIONARY.TABLES WHERE LIBNAME='WORK' AND MEMNAME='GOOD';

QUIT;

data want_array;

if _n_=1 then do;

do i=1 to bobs;

end;

do i=1 to gobs;

set good nobs=gobs point=i;

array g(&good_obs) \$10. _temporary_;

g(i)=good;

end;

end;

set have;

length s \$ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

gs+ (s in g);

bs+ (s in b);

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

Haikuo

Posts: 5,539

## Re: Scanning an Observation for a Word within a Variable

The more the merrier. A proposal with hash and prx :

data positive;
length word \$12;
input word;
datalines;
wow
great
ok
good
better
best
;

data negative;
length word \$12;
input word;
datalines;
meh
boring
never
;

data have;
length comment \$100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

/* Note: datasets positive and negative could also include a word specific value or

there could be a single table with positive and negative values */

data emotions;
set positive(in=pos) negative(in=neg);
word = upcase(word);
value = pos - neg;
run;

data want;
length word \$12;
if prxId = 0 then do;
declare hash w(dataset:"emotions");
w.definekey("word");
w.definedata("value");
w.definedone();
call missing(word, value);
prxId + prxparse("/\w+/");
end;

set have;
start = 1;
score = 0;
call prxnext(prxId, start, -1, comment, pos, len);
do while (pos > 0);
word = upcase(substr(comment, pos, len));
if 0 = w.find() then score + value;
call prxnext(prxId, start, -1, comment, pos, len);
end;
drop prxId word value start pos len;
run;

proc print noobs; run;

PG

PG
PROC Star
Posts: 1,316

## Re: Scanning an Observation for a Word within a Variable

Hello again, after a long absence.

Sorry, everybody, as often happens life interfered with my plans for this. But I think I'm back on track. I have a test environment set up, and have run some preliminary tests. Although SAS is too fast to make the results meaningul, preliminary times are (in order of the programs that were submitted):

Hai.kuo #1

real time           4.52 seconds
user cpu time       2.69 seconds
system cpu time     0.09 seconds

Tom Kari

real time          83.41 seconds
user cpu time       6.44 seconds
system cpu time     7.27 seconds

PGStats #1

real time           2.06 seconds
user cpu time       2.01 seconds
system cpu time     0.04 seconds

Hai.kuo #2

real time           2.54 seconds
user cpu time       2.44 seconds
system cpu time     0.06 seconds

Hai.kuo #3

real time           6.48 seconds
user cpu time       6.25 seconds
system cpu time     0.03 seconds

PGStats #2

real time           3.06 seconds
user cpu time       2.93 seconds
system cpu time     0.07 seconds

I'll be ramping up the data volumes to get some meaningful times, and will report back on both times and memory usage (which should be significant for the array and hash approaches).

Talk to you soon!
Tom

Super Contributor
Posts: 1,636

## Re: Scanning an Observation for a Word within a Variable

PG is the winnor! Congratulations!!!

PROC Star
Posts: 1,316

## Re: Scanning an Observation for a Word within a Variable

Not so fast!

1. I just finished some minor tweaking to the programs to correct some mistakes I'd made in getting them to all run off of the same data, but I don't think there will be any serious changes, but;

2. We still need to see how they perform at a volume that takes a minute or two.

So far, the only thing we're sure of is that my code is the slowest!:smileyblush:

Tom

Super Contributor
Posts: 1,636

## Re: Scanning an Observation for a Word within a Variable

Tom,

we change the rule to:

The slower the better.

Super User
Posts: 10,784

## Re: Scanning an Observation for a Word within a Variable

it is very interesting question, Are you doing Context Analysis ?

```data good;
input good \$upcase8. ;
cards;
Wow
Great
OK
Good
;
cards;
Meh
boring
;
data have;
input comment \$50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;
run;
data want(keep=comment score);
if _n_=1 then do;
if 0 then set good ;
declare hash g(dataset:'good');
g.definekey('good');
g.definedone();
b.definedone();
end;
set have;
score=0;
do i=1 to countw(comment, ,'ka');
temp=upcase(scan(comment,i, ,'ka')); put temp= ;
good=temp; if g.find()=0 then score+1;