BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rrr
Calcite | Level 5 rrr
Calcite | Level 5
Is there a way to count the frequency of strings within a character variable? I want to display a count of more commonly used words within a long character variable. Thanks!
1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;

View solution in original post

7 REPLIES 7
Cynthia_sas
SAS Super FREQ
Hi:
do you mean that you have a list of words (a, an, the, and) that you're looking for -- or your want to take a text string and find out the most common words in a string???

This may be a job for Text Miner:
http://support.sas.com/documentation/onlinedoc/txtminer/getstarted31.pdf

but in a Base SAS world, there's always writing out your "words" and then doing PROC FREQ on them.

cynthia
rrr
Calcite | Level 5 rrr
Calcite | Level 5
> Hi:
> do you mean that you have a list of words (a, an,
> the, and) that you're looking for -- or your want to
> take a text string and find out the most common
> words in a string???
> >
> but in a Base SAS world, there's always writing out
> your "words" and then doing PROC FREQ on them.
>
> cynthia

I would like to take the text string and find the most common words in the string. Is there a way to do that without using the Text Miner? If I have to, I could estimate the common words and use the countw function. Thanks for your help!
LinusH
Tourmaline | Level 20
If you want to count words for all rows together I think you should take out each word of the string, output them, and then do a PROC FREQ as Cynthia suggested. To do that you'll probably use some kind of do until logic together with the scan function and the output statement.

Linus
Data never sleeps
Cynthia_sas
SAS Super FREQ

Hi..
Leftover from my days as a Lit major, I wrote this program to do a frequency count of the words in the first chapter of Melville's "Moby Dick".

Interestingly enough, after you eliminate all the articles and prepositions and pronouns, the most frequently used word in the first chapter of Moby Dick is 'sea' (13 times) followed by 'water' (8 times). The words 'ship', 'soul', 'man' and 'whale' each occur 3 times. Anyway, the relevant part of that program is shown below -- I had to get rid of a stray '?' in the chapter, which is why the compress is in the code. Also, I turned everything to lower case, so 'The' and 'the' would get counted the same when I did a frequency on the WORD variable.

cynthia

** now break apart each line into separate lowercase words;
** but keep the word order (wordord) and the original capitalization (origword);
data cnt_chp1(keep=chapter pgno paracnt linenum wordord origword word);
    set moby_ch1;
    i = 1;
    origword = scan(record,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        output;
        i + 1;
        wordord = i;
        origword = scan(record,i);
        word = compress(lowcase(origword),'?');
    end;
run;
mftuchman
Quartz | Level 8
Be honest now - when you read the book originally, did you skip the 'Whaling Chapters'?
Cynthia_sas
SAS Super FREQ
Hi:
Not the first time or the second time. But by the third time I read it, yes, I did skip the whaling chapters.
cynthia
LinusH
Tourmaline | Level 20
If you already know what words to look for, the countw function may be of interest.

Regards,
Linus
Data never sleeps

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 9815 views
  • 0 likes
  • 4 in conversation