Solved: Re: Frequency of Strings

rrr · Posted 08-22-2008 02:59 PM

Is there a way to count the frequency of strings within a character variable? I want to display a count of more commonly used words within a long character variable. Thanks!

Cynthia_sas · Posted 08-26-2008 10:58 AM

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;

View solution in original post

Cynthia_sas · Posted 08-22-2008 06:19 PM

Hi:
do you mean that you have a list of words (a, an, the, and) that you're looking for -- or your want to take a text string and find out the most common words in a string???

This may be a job for Text Miner:
http://support.sas.com/documentation/onlinedoc/txtminer/getstarted31.pdf

but in a Base SAS world, there's always writing out your "words" and then doing PROC FREQ on them.

cynthia

rrr · Posted 08-26-2008 08:42 AM

> Hi:
> do you mean that you have a list of words (a, an,
> the, and) that you're looking for -- or your want to
> take a text string and find out the most common
> words in a string???
> >
> but in a Base SAS world, there's always writing out
> your "words" and then doing PROC FREQ on them.
>
> cynthia

I would like to take the text string and find the most common words in the string. Is there a way to do that without using the Text Miner? If I have to, I could estimate the common words and use the countw function. Thanks for your help!

LinusH · Posted 08-26-2008 09:22 AM

If you want to count words for all rows together I think you should take out each word of the string, output them, and then do a PROC FREQ as Cynthia suggested. To do that you'll probably use some kind of do until logic together with the scan function and the output statement.

Linus

Data never sleeps

Cynthia_sas · Posted 08-26-2008 10:58 AM

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;

mftuchman · Posted 09-02-2008 10:01 AM

Be honest now - when you read the book originally, did you skip the 'Whaling Chapters'?

Cynthia_sas · Posted 09-02-2008 10:13 AM

Hi:
Not the first time or the second time. But by the third time I read it, yes, I did skip the whaling chapters.
cynthia

LinusH · Posted 08-25-2008 05:26 AM

If you already know what words to look for, the countw function may be of interest.

Regards,
Linus

Data never sleeps

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away