Help using Base SAS procedures

Frequency of Strings

Accepted Solution Solved
Reply
New Contributor rrr
New Contributor
Posts: 3
Accepted Solution

Frequency of Strings

Is there a way to count the frequency of strings within a character variable? I want to display a count of more commonly used words within a long character variable. Thanks!

Accepted Solutions
Solution
‎03-25-2016 12:04 PM
SAS Super FREQ
Posts: 8,743

Re: Frequency of Strings

[ Edited ]

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;

View solution in original post


All Replies
SAS Super FREQ
Posts: 8,743

Re: Frequency of Strings

Hi:
do you mean that you have a list of words (a, an, the, and) that you're looking for -- or your want to take a text string and find out the most common words in a string???

This may be a job for Text Miner:
http://support.sas.com/documentation/onlinedoc/txtminer/getstarted31.pdf

but in a Base SAS world, there's always writing out your "words" and then doing PROC FREQ on them.

cynthia
New Contributor rrr
New Contributor
Posts: 3

Re: Frequency of Strings

> Hi:
> do you mean that you have a list of words (a, an,
> the, and) that you're looking for -- or your want to
> take a text string and find out the most common
> words in a string???
> >
> but in a Base SAS world, there's always writing out
> your "words" and then doing PROC FREQ on them.
>
> cynthia

I would like to take the text string and find the most common words in the string. Is there a way to do that without using the Text Miner? If I have to, I could estimate the common words and use the countw function. Thanks for your help!
Super User
Posts: 5,257

Re: Frequency of Strings

If you want to count words for all rows together I think you should take out each word of the string, output them, and then do a PROC FREQ as Cynthia suggested. To do that you'll probably use some kind of do until logic together with the scan function and the output statement.

Linus
Data never sleeps
Solution
‎03-25-2016 12:04 PM
SAS Super FREQ
Posts: 8,743

Re: Frequency of Strings

[ Edited ]

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;
Contributor
Posts: 42

Re: Frequency of Strings

Be honest now - when you read the book originally, did you skip the 'Whaling Chapters'?
SAS Super FREQ
Posts: 8,743

Re: Frequency of Strings

Hi:
Not the first time or the second time. But by the third time I read it, yes, I did skip the whaling chapters.
cynthia
Super User
Posts: 5,257

Re: Frequency of Strings

If you already know what words to look for, the countw function may be of interest.

Regards,
Linus
Data never sleeps
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 3028 views
  • 0 likes
  • 4 in conversation