Re: Scanning an Observation for a Word within a Variable

sucksatsas · Posted 05-31-2013 07:22 PM

Hello- I am trying to count the number of negative and positive words within each "Comment" observation.

Variables: Comment: A list of a website comments

Negative: A list of negative associated words, each observation is one negative word

Positive: A list of positive associated words, each observation is one positive word

I have gathered the information and combined them into one data set, now I am confused how to scan/index each comment observation when I'm looking to see if it contains any word in an entire list and eventually count them to something like this:

Does anyone have any insight?

Comment Negative Positive Rating

"Wow so great" 2

"It's OK" 1

"Good but boring" 0

"Meh" -1

Bad

Meh

boring

Wow

Great

OK

Good

Haikuo · Posted 05-31-2013 08:35 PM

Conveniently this problem becomes another showcase of Hash() power:

data good;

input good$;

cards;

wow

great

ok

good

better

best

;

data bad;

input bad$;

cards;

bad

meh

boring

never

;

data have;

input comment $50.;

cards;

"Wow so great"

"It's OK"

"Good but boring"

"Meh"

"Good Good Good Better Best, Never let it rest"

;

data want;

if _n_=1 then do;

if 0 then set good bad;

declare hash g(dataset:'good', multidata:'y');

g.definekey('good');

g.definedone();

declare hash b(dataset:'bad', multidata:'y');

b.definekey('bad');

b.definedone();

end;

set have;

length s $ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

do rc_good=g.find(key:s) by 0 while (rc_good=0);

gs+1;

rc_good=g.find_next(key:s);

end;

do rc_bad=b.find(key:s) by 0 while (rc_bad=0);

bs+1;

rc_bad=b.find_next(key:s);

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

Haikuo

TomKari · Posted 06-01-2013 03:43 PM

Although I agree completely with Haikuo that the hash object, and his program, are the best solution to this problem, it is also logically amenable to SQL:

data good;
input good$;
cards;
wow
great
ok
good
better
best
;

data bad;
input bad$;
cards;
bad
meh
boring
never
;

data have;
input comment $50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;

data have_words(keep=comment CandidateWord);
set have;
length CandidateWord $ 50;

do i=1 by 1 until (missing(CandidateWord));
CandidateWord=lowcase(scan(comment,i,'", '));

if ~missing(CandidateWord) then
output;
end;
run;

proc sql;
create table want_details as
select comment, 1 as rating
   from have_words h inner join good g on(h.CandidateWord = g.good)
    outer union corresponding
     select comment, -1 as rating
      from have_words h inner join bad b on(h.CandidateWord = b.bad)
;
quit;

proc means noprint nonobs nway;
class comment;
var rating;
output out=want(drop=_type_ _freq_) sum=;
run;

Tom

Linlin · Posted 06-01-2013 09:29 PM

Hi Tom,

You should change "her" in

"Although I agree completely with Haikuo that the hash object, and her program, are the best solution to this problem, it is also logically amenable to SQL:"

to "his". - Linlin

TomKari · Posted 06-01-2013 11:04 PM

OOPS! Thanks, Linlin. It's lost in my excellent but short memory, but somehow I ended up with the impression that our esteemed colleague shares your gender, not mine.

Well, at least I said his program is the best solution...that's gotta get me SOME points!!

Best,

Tom

Haikuo · Posted 06-02-2013 09:59 AM

Thanks, LinLin, for putting the "Y" chromosome back to my online identity. Tom, not being a female really wasn't up to me, LOL. And being a fan since 3 years ago in SAS-L, I am honored for being mentioned by your post,

Haikuo

PGStats · Posted 06-02-2013 01:47 PM

I'd like to see a performance comparison between the methods, including this one using an informat :

data positive;
input word$;
datalines;
wow
great
ok
good
better
best
;

data negative;
input word$;
datalines;
bad
meh
boring
never
;

data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

data emotion;
length type hlo $1 label $4 fmtname $7;
set positive(in=pos) negative(in=neg);
fmtname = "Emotion";
type = "I"; hlo = "U";
start = upcase(word);
label = put(pos - neg, best4.);
run;

proc format cntlin=emotion; run;

data want(drop=pos len i);
set have;
emoScore = 0;
call scan(comment, 1, pos, len);
do i = 2 by 1 while (pos>0);
     emoScore = sum(emoScore, input(substr(comment, pos, len), ?? emotion.));
     call scan(comment, i, pos, len);
     end;
run;

proc print noobs; run;

PG

TomKari · Posted 06-02-2013 07:27 PM

That would be great fun. I always like to make these somewhat real world. @sucksatsas would you be in a position to share your lists of "good" and "bad" words, the number of comments that you want to use this on, and how many words are typically in your comments?

I can generate some data that'll pound all these solutions, and I'll post the results. Hopefully over the next week or two.

Tom

Haikuo · Posted 06-02-2013 08:55 PM

Thanks, Tom, looking forward to your verdict. After spending more time with it, I have made the Hash code more succinct (not sure about the efficient improvement though). At the same time, since Tom is doing the service for us, I 'd like to add Array() approach to the pool of candidates .

data good;

input good$;

cards;

wow

great

ok

good

better

best

;

data bad;

input bad$;

cards;

bad

meh

boring

never

;

data have;

input comment $50.;

cards;

"Wow so great"

"It's OK"

"Good but boring"

"Meh"

"Good Good Good Better Best, Never let it rest"

;

/*Hash()*/

data want_hash;

if _n_=1 then do;

if 0 then set good bad;

declare hash g(dataset:'good', multidata:'y');

g.definekey('good');

g.definedone();

declare hash b(dataset:'bad', multidata:'y');

b.definekey('bad');

b.definedone();

end;

set have;

length s $ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

gs+ (not g.find(key:s));

bs+ (not b.find(key:s));

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

/* Array()*/

proc sql noprint;

select nobs into :bad_obs from dictionary.tables where LIBNAME='WORK' AND MEMNAME='BAD';

SELECT NOBS INto :GOOD_OBS FROM DICTIONARY.TABLES WHERE LIBNAME='WORK' AND MEMNAME='GOOD';

QUIT;

data want_array;

if _n_=1 then do;

do i=1 to bobs;

set bad nobs=bobs point=i;

array b(&bad_obs) $10. _temporary_;

b(i)=bad;

end;

do i=1 to gobs;

set good nobs=gobs point=i;

array g(&good_obs) $10. _temporary_;

g(i)=good;

end;

set have;

length s $ 50;

do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

gs+ (s in g);

bs+ (s in b);

end;

Rating=sum(gs,-bs);

call missing(gs, bs);

keep comment rating;

run;

Haikuo

PGStats · Posted 06-02-2013 10:49 PM

The more the merrier. A proposal with hash and prx :

data positive;
length word $12;
input word;
datalines;
wow
great
ok
good
better
best
;

data negative;
length word $12;
input word;
datalines;
bad
meh
boring
never
;

data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

/* Note: datasets positive and negative could also include a word specific value or

there could be a single table with positive and negative values */

data emotions;
set positive(in=pos) negative(in=neg);
word = upcase(word);
value = pos - neg;
run;

data want;
length word $12;
if prxId = 0 then do;
     declare hash w(dataset:"emotions");
     w.definekey("word");
     w.definedata("value");
     w.definedone();
     call missing(word, value);
     prxId + prxparse("/\w+/");
     end;

set have;
start = 1;
score = 0;
call prxnext(prxId, start, -1, comment, pos, len);
do while (pos > 0);
     word = upcase(substr(comment, pos, len));
     if 0 = w.find() then score + value;
     call prxnext(prxId, start, -1, comment, pos, len);
     end;
drop prxId word value start pos len;
run;

proc print noobs; run;

PG

TomKari · Posted 07-16-2013 10:08 AM

Hello again, after a long absence.

Sorry, everybody, as often happens life interfered with my plans for this. But I think I'm back on track. I have a test environment set up, and have run some preliminary tests. Although SAS is too fast to make the results meaningul, preliminary times are (in order of the programs that were submitted):

Hai.kuo #1

      real time           4.52 seconds
      user cpu time       2.69 seconds
      system cpu time     0.09 seconds

Tom Kari

      real time          83.41 seconds
      user cpu time       6.44 seconds
      system cpu time     7.27 seconds

PGStats #1

      real time           2.06 seconds
      user cpu time       2.01 seconds
      system cpu time     0.04 seconds

Hai.kuo #2

      real time           2.54 seconds
      user cpu time       2.44 seconds
      system cpu time     0.06 seconds

Hai.kuo #3

      real time           6.48 seconds
      user cpu time       6.25 seconds
      system cpu time     0.03 seconds

PGStats #2

      real time           3.06 seconds
      user cpu time       2.93 seconds
      system cpu time     0.07 seconds

I'll be ramping up the data volumes to get some meaningful times, and will report back on both times and memory usage (which should be significant for the array and hash approaches).

Talk to you soon!
Tom

Linlin · Posted 07-16-2013 11:19 AM

PG is the winnor! Congratulations!!!

TomKari · Posted 07-17-2013 12:50 PM

Not so fast!

1. I just finished some minor tweaking to the programs to correct some mistakes I'd made in getting them to all run off of the same data, but I don't think there will be any serious changes, but;

2. We still need to see how they perform at a volume that takes a minute or two.

So far, the only thing we're sure of is that my code is the slowest!:smileyblush:

Tom

Linlin · Posted 07-17-2013 02:25 PM

Tom,

we change the rule to:

The slower the better.

Ksharp · Posted 07-17-2013 09:23 PM

it is very interesting question, Are you doing Context Analysis ?

data good;
  input good $upcase8. ;
  cards;
Wow
Great
OK
Good
  ;
  data bad;
  input bad $upcase8. ;
  cards;
Bad
Meh
boring
  ;
data have;
  input comment $50.;
  cards;
  "Wow so great"     
"It's OK"            
"Good but boring"    
"Meh"
"Good Good Good Better Best, Never let it rest"
;
run;
data want(keep=comment score);
  if _n_=1 then do;
    if 0 then set good ;
    if 0 then set bad;
    declare hash g(dataset:'good');
    g.definekey('good');
    g.definedone();
    declare hash b(dataset:'bad');
      b.definekey('bad');
      b.definedone();
  end;
  set have;
  score=0;
  do i=1 to countw(comment, ,'ka');
   temp=upcase(scan(comment,i, ,'ka')); put temp= ;
   good=temp; if g.find()=0 then score+1;
   bad=temp; if b.find()=0 then score+(-1);
 end;
run;

Ksharp

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away