BookmarkSubscribeRSS Feed
Venkat4
Quartz | Level 8

I have a text field with several sentences. I am trying to do the below -

1) separate all words -> get frequency counts on each word

2) separate 2 consecutive words --> get frequency counts on each bigram

3) separate 3 consecutive words --> get frequency counts on each trigram

and so on...

 

I think I am only successful in step 1 partially. I am looking for base SAS code that can do the bigram, trigram counts etc.

Also, I want to use soundex, spedis etc to group them all even there is a misspelling.

 

Anyone can give me some pointers (not looking for the entire solution) to solve this using SAS code?

 

Thank you.

5 REPLIES 5
ballardw
Super User

It might help to provide some example data an what you expect the result for that example to be.

 

For one thing a definition of consecutive words for your purpose is needed. Are "words" separated by a comma consecutive? by a period? by some character like @ # $ % consecutive?

When counting is case to be considered? Would "This street" and "this street" be in the same count?

 

Another very import bit might be the "an so on". Just how long are your phrases, in terms of your "word" definition?

 

Use of soundex may well be questionable as you might find cases of a multi-syllable long word matching a soundex result of several short words.

Venkat4
Quartz | Level 8

Thank you!

 

I can make them all upcase or lowcase. Also, I have a stopwords that I want to remove all stop words first.

Here is an example of what I am looking for on the below example text, all separated by space.

 

"Lorem Ipsum text is simply dummy text. "

 

Unigram - Word and counts -

Lorem - 1

Ipsum - 1

text -2

simply - 1

dummy - 1

 

Bigram - two words and counts -

Lorem Ipsum - 1

Ipsum text - 1

text simply - 1

simply text - 1

dummy text - 1

 

Same way trigram will be derived.

 

Another goal is to find misspellings of words and group them together (table will have 3 columns - group_word, different_variations, count) so when I search for correct word or phrases in the newer data I can use that group and include all variations instead of only the correct spelling of the word or phrases.

Reeza
Super User

Shows how to separate the words.

https://github.com/statgeek/SAS-Tutorials/blob/master/text_analysis.sas

 

For bigram/trigram I suggest using an array.

Here's a tutorial on using Arrays in SAS
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

 


@Venkat4 wrote:

I have a text field with several sentences. I am trying to do the below -

1) separate all words -> get frequency counts on each word

2) separate 2 consecutive words --> get frequency counts on each bigram

3) separate 3 consecutive words --> get frequency counts on each trigram

and so on...

 

I think I am only successful in step 1 partially. I am looking for base SAS code that can do the bigram, trigram counts etc.

Also, I want to use soundex, spedis etc to group them all even there is a misspelling.

 

Anyone can give me some pointers (not looking for the entire solution) to solve this using SAS code?

 

Thank you.


 

Venkat4
Quartz | Level 8

Thank you, that was very helpful Reeza!

 

I used arrays on the numbers mostly and never did on the text fields. The example you gave also used all numbers.

I will look for arrays with text fields in SAS online, but if you have any simple example I'd like to see so I can expand on that. Thank you again.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 2375 views
  • 2 likes
  • 3 in conversation