Solved: Word list search

dio · Posted 02-22-2012 06:47 AM

Hi all,

i am a SAS beginner and my issue is the following: I have a long list of words and associated values (e.g., frequency in spoken language) in one dataset (Word_List) and another even longer dataset consisting of phrases(Phrase_List). I have to check whether each of the words from Word_List appears in each of the phrases in Phrase_List. If found, then the phrase is assigned the value of the word found.

So far, I have only managed to do this for single words in the Phrase_List dataset using INDEXW. However, i would need tips on how to do this automatically from one dataset to the other.

My code for single word search:

data ref.phrase_score;

set ref.phrase_list;

score=0;

found=indexw(phrase,"able");

if found=0 then delete;

else score+3.56;

keep phrase_ID phrase found score;

run;

I would really appreciate tips on how to do this automatically for all words from the list and using these two separate files.

Thanks,

Dan

PGStats · Posted 02-23-2012 10:44 AM

Then it is quite simple to do (if you don't need to count multiple occurences) :

data phrases;
length phrase $200; /* longer, if needed */
input;
phrase = trim(lowcase(_infile_));
datalines;
i feel at my best because i have been working on important goals for the past three days, and i've been able to complete them
i feel a bit disappointed that i wasn't as eager about it as everyone else but it was the best book i've read at least
i just hope we get to fix our relationship as best friends, it just takes a little bit of courage
;

data words;
length word $16;
input word score;
word = lowcase(word);
datalines;
able 5.89
best 6.76
courage 6.30
;

proc sql;
create table scores as
select phrase, sum((indexw(phrase, word)>0)*score) as phraseScore
from phrases, words
group by phrase;

proc print; run;

Good luck.

PG

View solution in original post

Haikuo · Posted 02-22-2012 08:08 AM

It seems to me you could use hash(). Following code base on assumptions: 1) word_list is in table (dataset), instead of macro variable, character variable or raw text. 2)table work_list share the same key variable "phrase" with your targeted table ref.phrase.list 3) 'score' is retained and sum up by 3.56 each time there is a match.

data ref.phrase_score;

if 0 then set word_list;

declare hash _m(dataset: "word_list");

_m.definekey ('phrase');

_m.definedata('phrase');

_m.definedone();

do until (last);

set ref.phrase_list end=last;

if _m.find()=0 then do;

score+3.56;

output;

end;

stop;

run;

You maybe able to do it using SQL as well, but I haven't figure out 'score' part of it.

Regards,

Haikuo

art297 · Posted 02-22-2012 08:16 AM

Dan,

A perfect chore for a no longer documented, but still existing proc: proc spell.

Take a look at: http://www.sascommunity.org/wiki/Proc_spell

If you need it, I still have a copy of the original documentation. Since it is there, you can easily build the dictionary of words and proc spell can do all of the heavy work. Then, you only have to merge the resulting file with your full word list that contains the values.

PGStats · Posted 02-22-2012 10:16 AM

What do you want as a result (score) when a word appears more than once in the phrase? - PG

PG

Ksharp · Posted 02-22-2012 09:50 PM

What do these two dataset look like ? and output you want ?

You'd better give an example which can make your question clearer.

I think SQL or Hash can get it.

Ksharp

dio · Posted 02-23-2012 07:12 AM

Word list search

Hi,

thanks a lot for the tips! I have tried Proc Spell so far, which seems to work quite nicely, only that it delivers the frequencies of words from the lexicon that do NOT appear in the phrase list, whereas i need the words that do appear.

Example:

Word file= {"able"; "best";"courage"}. Every word in the list has an associated emotion rating (i am a psychologist) on a scale from 1-7: (able; 5.89); (best; 6.76); (courage; 6.30).

Phrase file = {"i feel at my best because i have been working on important goals for the past three days, and i've been able to complete them";

"i feel a bit disappointed that i wasn't as eager about it as everyone else but it was the best book i've read at least";

"i just hope we get to fix our relationship as best friends, it just takes a little bit of courage"}

I want to check if each of the words is contained in each phrase, and then assign the word's score to the phrase if found. E.g., best and courage are both present in phrase 3, so the phrase_score for phrase 3 will be equal to 6.76 + 6.30 = 13.06.

Of course, it would be nice that if the word is found more than once in the same phrase, the phrase_score is incremented as well, but at moment i am trying to get this to work for one occurence per phrase.

Hash looks very interesting, tried it, but i seem to be doing something wrong there, because it delivers an output file with 0 observations.

What I have done so far as a test:

/* Count number of observations in ref.word_list and ref.phrase_list*/

PROC SQL;

SELECT COUNT(*), max(length(word)) INTO: WordNo, :Maxl FROM ref.word_list;

select count(*), max(length(phrase)) INTO: PhraseNo, :Maxlg FROM ref.phrase_list;

QUIT;

/* Load only 15 words from the list of words into a one-dimensional array, for testing purposes*/

data ref.solution;

set ref.phrase_list end=eoflist;

length word1-word&WordNo $&Maxl;

array wordz {&WordNo} word1-word&WordNo;

retain word1-word&WordNo;

if _n_=1 then do iWord = 1 to 15;

set ref.word_list;

wordz{iWord} = word;

end;

/* Look for each word in each phrase*/

score=0;

do iWord = 1 to 15;

lag(score);

MatchW = indexw(phrase,wordz{iWord});

if MatchW > 0 then score+3;

end;

keep phrase score;

run;

Dan

PGStats · Posted 02-23-2012 10:44 AM

Then it is quite simple to do (if you don't need to count multiple occurences) :

data phrases;
length phrase $200; /* longer, if needed */
input;
phrase = trim(lowcase(_infile_));
datalines;
i feel at my best because i have been working on important goals for the past three days, and i've been able to complete them
i feel a bit disappointed that i wasn't as eager about it as everyone else but it was the best book i've read at least
i just hope we get to fix our relationship as best friends, it just takes a little bit of courage
;

data words;
length word $16;
input word score;
word = lowcase(word);
datalines;
able 5.89
best 6.76
courage 6.30
;

proc sql;
create table scores as
select phrase, sum((indexw(phrase, word)>0)*score) as phraseScore
from phrases, words
group by phrase;

proc print; run;

Good luck.

PG

dio · Posted 02-23-2012 12:42 PM

Works like a charm, thanks a lot!

Zachary · Posted 06-01-2015 05:41 PM

I apologize if this is a very old thread. I really like the sql code to accomplish the word searches here. But now I would like to do a search on a phrase instead of just words.

But, as you can see I am even having problems with that as my second phrase is not reading in the data correctly.

Please let me know of any suggestions. Thank you very much in advance.

data phrases;
length phrase $200; /* longer, if needed */
input;
phrase = trim(lowcase(_infile_));
datalines;
phrase one able should be 0
phrase two able stuff should be 3
phrase three best should be 4
phrase four courage should be 5
phrase five able courage should be 5
phrase six able stuff courage should be 8
;

data words;
length word $16;
input word score;
word = lowcase(word);
datalines;
able stuff 3
best 4
courage 5
;

proc sql;
create table scores as
select phrase, sum((indexw(phrase, word)>0)*score) as phraseScore
from phrases, words
group by phrase;

proc print; run;

PGStats · Posted 06-01-2015 09:59 PM

Use the & input modifier to read in many words and make sure you separate the score from the words by at least two spaces :

data words;

length word $16;

input word & score;

word = lowcase(word);

datalines;

able stuff 3

best 4

courage 5

;

The rest stays as is.

PG

Zachary · Posted 06-02-2015 11:05 AM

Thank you very much. It worked, and it worked well.

Now let me see if I can throw a curve ball into the mix. The code you provided does an excellent job of summating the scores associated with each of the words. Now, let us think about saying that the scores no longer matter. Now when it ultimately creates a table I would like to create columns for each word or phrase that it finds:

Phrase	Flag_able_stuff	Flag_best	Flag_courage
phrase one able
phrase two able stuff	1
phrase three best		1
phrase four courage			1
phrase five able courage			1
phrase six able stuff courage	1		1

The newly created column for Flag_able_stuff had to put a _ in between the words to work.

Can a variation of your excellent SQL code do this? Or any other suggestions?

PGStats · Posted 06-02-2015 01:53 PM

Will require a proc transpose step :

proc sql;

create table scores as

select phrase, word, indexw(phrase, word) > 0 as wordPresent

from phrases, words

order by phrase, word;

quit;

Proc transpose data=scores out=scoresT(drop=_:) prefix=flag_;

by phrase;

var wordPresent;

id word;

run;

proc print data=scoresT noobs; run;

Note: I do not recommend the wide data structure of scoresT. A long version :

proc sql;

create table scoresL as

select phrase, word

from

phrases inner join

words on indexw(phrase, word) > 0

order by phrase, word;

quit;

proc print data=scoresL noobs; run;

would be far more useful for most purposes.

PG

Zachary · Posted 06-03-2015 12:21 PM

Wow thanks. I think I am picking up that the Cartesian product is a beast in here.

Question - everything seems to be working, but I get many errors like the following:

The ID value "'flag_day qrc meet surgery ime ce'n" occurs twice in the same BY group

I am assuming that is because my character lengths are too long. But it is sort of necessary for what I am doing.

Will there be a limit as to the number of characters I may use, or may I extend the default in SQL?

Thanks.

PGStats · Posted 06-03-2015 01:33 PM

The restriction is with the number of characters that SAS allows for variable names, which cannot be longer than 32 characters. There is no way around that. Why would anyone want variable names longer than 32 chars? You could choose your own variable names by adding them to the WORDS dataset :

data words;

length word $64 vname $27; /* 27 = 32 - length("flag_") */

input word & score vname;

word = lowcase(word);

datalines;

able stuff 3 able_stuff

best 4 best

courage 5 courage

;

proc sql;

create table scores as

select phrase, word, vname, indexw(phrase, word) > 0 as wordPresent

from phrases, words

order by phrase, vname;

quit;

proc sql;

create table scoresL as

select phrase, vname

from

phrases inner join

words on indexw(phrase, word) > 0

order by phrase, vname;

quit;

Proc transpose data=scores out=scoresT(drop=_:) prefix=flag_;

by phrase;

var wordPresent;

id vname;

run;

proc print noobs; run;

You will get a proper dataset but will loose the exact expressions related to the columns. This is just one more reason to prefer the long over the wide dataset format.

PG

Zachary · Posted 06-03-2015 01:56 PM

Thank you again PGStats. Once again I have been humbled by what I am trying to accomplish. I have one more idea brewing here.

But, a little background.

I have over 1,000,000 million comments used by our insurance company. The goal with part of our predictive modeling is to see if any key words or phrases are predictive of future reserve changes and other indices.

We began the process by doing a Text Analysis within Enerprise Miner. From what I have seen this is a very useful tool in itself by creating clusters and/or factors that academically would be very useful.

I am trying to take the words and phrases that come out as statisticially significant- and provide further analyses to see if they are statistically meaningful. My goal was to collect hundreds - even thousands of these key words and phrases - then let the bivariate indices such as correlations be our "stepping stone" for figuring if they would be worthy of inclusion in a decision tree model. Ideally I would like to find anywhere from a few dozen to over one hundred words & phrases that could potentially predict reserve changes. Of course these would not be the main predictors.

Any suggestions are welcome. Right now I have about a half a dozen different ideas brewing on how I wish to proceed. But no worries about the 1 million records. I have already pulled a random sample of 5,000 for working purposes.

Word list search

Word list search

Word list search

Word list search

Word list search

Word list search

Word list search

Word list search

Word list search

Re: Word list search

Re: Word list search

Re: Word list search

Re: Word list search

Re: Word list search

Re: Word list search

Re: Word list search

Registration is open

SAS Training: Just a Click Away