DATA Step, Macro, Functions and more

Scanning an Observation for a Word within a Variable

Reply
N/A
Posts: 1

Scanning an Observation for a Word within a Variable

Hello- I am trying to count the number of negative and positive words within each "Comment" observation.


Variables: Comment: A list of a website comments

  Negative: A list of negative associated words, each observation is one negative word

  Positive: A list of positive associated words, each observation is one positive word    

I have gathered the information and combined them into one data set, now I am confused how to scan/index each comment observation when I'm looking to see if it contains any word in an entire list and eventually count them to something like this:

Does anyone have any insight?

Comment         Negative    Positive     Rating

"Wow so great"                                        2

"It's OK"                                                  1

"Good but boring"                                     0

"Meh"                                                     -1

                        

                         Bad

                         Meh

                         boring

                                       

                                        Wow

                                        Great

                                        OK

                                        Good

Respected Advisor
Posts: 3,156

Re: Scanning an Observation for a Word within a Variable

Posted in reply to sucksatsas

Conveniently this problem becomes another showcase of Hash() power:

data good;

  input good$;

  cards;

  wow

  great

  ok

  good

  better

  best

  ;

  data bad;

  input bad$;

  cards;

  bad

  meh

  boring

  never

  ;

data have;

  input comment $50.;

  cards;

  "Wow so great"     

"It's OK"            

"Good but boring"    

"Meh"

"Good Good Good Better Best, Never let it rest"

;

data want;

  if _n_=1 then do;

    if 0 then set good bad;

    declare hash g(dataset:'good', multidata:'y');

    g.definekey('good');

    g.definedone();

    declare hash b(dataset:'bad', multidata:'y');

      b.definekey('bad');

      b.definedone();

  end;

  set have;

  length s $ 50;

    do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

         do rc_good=g.find(key:s) by 0 while (rc_good=0);

gs+1;

rc_good=g.find_next(key:s);

        end;

         do rc_bad=b.find(key:s) by 0 while (rc_bad=0);

           bs+1;

rc_bad=b.find_next(key:s);

            end;

       end;

       Rating=sum(gs,-bs);

       call missing(gs, bs);

       keep comment rating;

run;

Haikuo

PROC Star
Posts: 1,167

Re: Scanning an Observation for a Word within a Variable

Although I agree completely with Haikuo that the hash object, and his program, are the best solution to this problem, it is also logically amenable to SQL:

data good;
input good$;
cards;
  wow
  great
  ok
  good
  better
  best
  ;

data bad;
input bad$;
cards;
  bad
  meh
  boring
  never
  ;

data have;
input comment $50.;
cards;
  "Wow so great"     
"It's OK"            
"Good but boring"    
"Meh"
"Good Good Good Better Best, Never let it rest"
;

data have_words(keep=comment CandidateWord);
set have;
length CandidateWord $ 50;

do i=1 by 1 until (missing(CandidateWord));
  CandidateWord=lowcase(scan(comment,i,'", '));

  if ~missing(CandidateWord) then
   output;
end;
run;

proc sql;
create table want_details as
  select comment, 1 as rating
   from have_words h inner join good g on(h.CandidateWord = g.good)
    outer union corresponding
     select comment, -1 as rating
      from have_words h inner join bad b on(h.CandidateWord = b.bad)
;
quit;

proc means noprint nonobs nway;
class comment;
var rating;
output out=want(drop=_type_ _freq_) sum=;
run;

Tom

Super Contributor
Posts: 1,636

Re: Scanning an Observation for a Word within a Variable

Hi Tom,

You should change "her" in

"Although I agree completely with Haikuo that the hash object, and her program, are the best solution to this problem, it is also logically amenable to SQL:"

to "his".  - Linlin

PROC Star
Posts: 1,167

Re: Scanning an Observation for a Word within a Variable

OOPS! Thanks, Linlin. It's lost in my excellent but short memory, but somehow I ended up with the impression that our esteemed colleague shares your gender, not mine.

Well, at least I said his program is the best solution...that's gotta get me SOME points!!

Best,

  Tom

Respected Advisor
Posts: 3,156

Re: Scanning an Observation for a Word within a Variable

Thanks, LinLin, for putting  the "Y" chromosome back to my online identity. Tom,  not being a female really wasn't up to me, LOL. And being a fan since 3 years ago in SAS-L, I am honored for being mentioned by your post,

Haikuo

Respected Advisor
Posts: 4,919

Re: Scanning an Observation for a Word within a Variable

Posted in reply to sucksatsas

I'd like to see a performance comparison between the methods, including this one using an informat :

data positive;
input word$;
datalines;
wow
great
ok
good
better
best
;

data negative;
input word$;
datalines;
bad
meh
boring
never
;
 
data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

data emotion;
length type hlo $1 label $4 fmtname $7;
set positive(in=pos) negative(in=neg);
fmtname = "Emotion";
type = "I"; hlo = "U";
start = upcase(word);
label = put(pos - neg, best4.);
run;

proc format cntlin=emotion; run;

data want(drop=pos len i);
set have;
emoScore = 0;
call scan(comment, 1, pos, len);
do i = 2 by 1 while (pos>0);
     emoScore = sum(emoScore, input(substr(comment, pos, len), ?? emotion.));
     call scan(comment, i, pos, len);
     end;
run;

proc print noobs; run;

PG

PG
PROC Star
Posts: 1,167

Re: Scanning an Observation for a Word within a Variable

That would be great fun. I always like to make these somewhat real world. @sucksatsas would you be in a position to share your lists of "good" and "bad" words, the number of comments that you want to use this on, and how many words are typically in your comments?

I can generate some data that'll pound all these solutions, and I'll post the results. Hopefully over the next week or two.

Tom

Respected Advisor
Posts: 3,156

Re: Scanning an Observation for a Word within a Variable

Thanks, Tom, looking forward to your verdict. After spending more time with it, I have made the Hash code more succinct (not sure about the efficient improvement though). At the same time, since Tom is doing the service for us, I 'd like to add Array() approach to the pool of candidates Smiley Happy.

data good;

  input good$;

  cards;

  wow

  great

  ok

  good

  better

  best

  ;

  data bad;

  input bad$;

  cards;

  bad

  meh

  boring

  never

  ;

data have;

  input comment $50.;

  cards;

  "Wow so great"    

"It's OK"           

"Good but boring"   

"Meh"

"Good Good Good Better Best, Never let it rest"

;

/*Hash()*/

data want_hash;

  if _n_=1 then do;

    if 0 then set good bad;

    declare hash g(dataset:'good', multidata:'y');

    g.definekey('good');

    g.definedone();

    declare hash b(dataset:'bad', multidata:'y');

      b.definekey('bad');

      b.definedone();

  end;

  set have;

  length s $ 50;

    do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

   gs+ (not g.find(key:s));

   bs+ (not b.find(key:s));

    end;

       Rating=sum(gs,-bs);

       call missing(gs, bs);

       keep comment rating;

run;

/* Array()*/

proc sql noprint;

  select nobs into :bad_obs from dictionary.tables where LIBNAME='WORK' AND MEMNAME='BAD';

  SELECT NOBS INto :GOOD_OBS FROM DICTIONARY.TABLES WHERE LIBNAME='WORK' AND MEMNAME='GOOD';

QUIT;

data want_array;

  if _n_=1 then do;

     do i=1 to bobs;

         set bad nobs=bobs point=i;

         array b(&bad_obs) $10. _temporary_;

       b(i)=bad;

       end;

       do i=1 to gobs;

          set good nobs=gobs point=i;

         array g(&good_obs) $10. _temporary_;

       g(i)=good;

       end;

end;

  set have;

  length s $ 50;

    do i=1 by 1 until (missing(s));

s=lowcase(scan(comment,i,'", '));

   gs+ (s in g);

   bs+ (s in b);

    end;

       Rating=sum(gs,-bs);

       call missing(gs, bs);

       keep comment rating;

run;


Haikuo

        

Respected Advisor
Posts: 4,919

Re: Scanning an Observation for a Word within a Variable

The more the merrier. A proposal with hash and prx :

data positive;
length word $12;
input word;
datalines;
wow
great
ok
good
better
best
;

data negative;
length word $12;
input word;
datalines;
bad
meh
boring
never
;
 
data have;
length comment $100;
input comment &;
datalines;
Wow so great!
It's OK
Good but boring...
Meh
Good Good Good Better Best, Never let it rest.
;

/* Note: datasets positive and negative could also include a word specific value or

there could be a single table with positive and negative values */

data emotions;
set positive(in=pos) negative(in=neg);
word = upcase(word);
value = pos - neg;
run;

data want;
length word $12;
if prxId = 0 then do;
     declare hash w(dataset:"emotions");
     w.definekey("word");
     w.definedata("value");
     w.definedone();
     call missing(word, value);
     prxId + prxparse("/\w+/");
     end;

set have;
start = 1;
score = 0;
call prxnext(prxId, start, -1, comment, pos, len);
do while (pos > 0);
     word = upcase(substr(comment, pos, len));
     if 0 = w.find() then score + value;
     call prxnext(prxId, start, -1, comment, pos, len);
     end;
drop prxId word value start pos len;
run;

proc print noobs; run;

PG

PG
PROC Star
Posts: 1,167

Re: Scanning an Observation for a Word within a Variable

Hello again, after a long absence.

Sorry, everybody, as often happens life interfered with my plans for this. But I think I'm back on track. I have a test environment set up, and have run some preliminary tests. Although SAS is too fast to make the results meaningul, preliminary times are (in order of the programs that were submitted):

Hai.kuo #1

      real time           4.52 seconds
      user cpu time       2.69 seconds
      system cpu time     0.09 seconds

Tom Kari

      real time          83.41 seconds
      user cpu time       6.44 seconds
      system cpu time     7.27 seconds

PGStats #1

      real time           2.06 seconds
      user cpu time       2.01 seconds
      system cpu time     0.04 seconds

Hai.kuo #2

      real time           2.54 seconds
      user cpu time       2.44 seconds
      system cpu time     0.06 seconds

Hai.kuo #3

      real time           6.48 seconds
      user cpu time       6.25 seconds
      system cpu time     0.03 seconds

PGStats #2

      real time           3.06 seconds
      user cpu time       2.93 seconds
      system cpu time     0.07 seconds

I'll be ramping up the data volumes to get some meaningful times, and will report back on both times and memory usage (which should be significant for the array and hash approaches).

Talk to you soon!
  Tom

Super Contributor
Posts: 1,636

Re: Scanning an Observation for a Word within a Variable

PG is the winnor! Congratulations!!!Smiley HappySmiley HappySmiley Happy

PROC Star
Posts: 1,167

Re: Scanning an Observation for a Word within a Variable

Not so fast!

1. I just finished some minor tweaking to the programs to correct some mistakes I'd made in getting them to all run off of the same data, but I don't think there will be any serious changes, but;

2. We still need to see how they perform at a volume that takes a minute or two.

So far, the only thing we're sure of is that my code is the slowest!:smileyblush:

Tom

Super Contributor
Posts: 1,636

Re: Scanning an Observation for a Word within a Variable

Tom,

we change the rule to:

The slower the betterSmiley Wink.

Super User
Posts: 10,018

Re: Scanning an Observation for a Word within a Variable

Posted in reply to sucksatsas

it is very interesting question, Are you doing Context Analysis ?

data good;
  input good $upcase8. ;
  cards;
Wow
Great
OK
Good
  ;
  data bad;
  input bad $upcase8. ;
  cards;
Bad
Meh
boring
  ;
data have;
  input comment $50.;
  cards;
  "Wow so great"     
"It's OK"            
"Good but boring"    
"Meh"
"Good Good Good Better Best, Never let it rest"
;
run;
data want(keep=comment score);
  if _n_=1 then do;
    if 0 then set good ;
    if 0 then set bad;
    declare hash g(dataset:'good');
    g.definekey('good');
    g.definedone();
    declare hash b(dataset:'bad');
      b.definekey('bad');
      b.definedone();
  end;
  set have;
  score=0;
  do i=1 to countw(comment, ,'ka');
   temp=upcase(scan(comment,i, ,'ka')); put temp= ;
   good=temp; if g.find()=0 then score+1;
   bad=temp; if b.find()=0 then score+(-1);
 end;
run;

Ksharp

Ask a Question
Discussion stats
  • 29 replies
  • 936 views
  • 1 like
  • 9 in conversation