BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dylanleong78
Fluorite | Level 6

Hi,

 

I've been looking into performing spellcheck against a specific column and identify misspelt words that should be corrected.

 

Based on reading several articles and SAS documentation (see below), I've managed to come up with a basic solution that "sort of" works.

https://towardsdatascience.com/handling-text-data-quality-issues-can-yuo-h-andle-thsi-7f77dac3ddff

https://documentation.sas.com/doc/en/pgmsascdc/v_011/casvtapg/n1vl3yuyzcl7x4n0zc4vwynknl99.htm

 

As you can see from spellcheck.sas, this is my thought process:

1. download the en_GB.dic file from https://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice

2. import the file into a SAS data set

3. create a sample data set containing 6 rows of text with misspelled words in the text column

4. use tpParse to parse the sample data set and produce tpParse_out data set

5. use tpSpell to attempt to identify misspelt words and produce tpSpell_out data set

6. in tpSpell_out, the _OriginalStem_ should contain the original misspelled word whereas the _Parent_ should contain the correct word

 

See the TPSPELL_OUT.csv for the output generated.

 

However, the solution only works for misspelt words (e.g. compzter) that has the same word correctly spelled repeated within the sample data set multiple times (e.g. computer). If a word (e.g. tihs) is only in the sample data set once, tpSpell will not be able to determine that "tihs" is actually "this" even though the word exists in the dictionary data set. I have a number of words misspelled in the sample data set (e.g. sentnce, neeeds, storuy, etc...) 

 

Other misspelt words can potentially be identified by the _ComplexTag_ value = "inc" but I have no idea what "inc" means.

 

I've been toying with the dictPenalty, maxSpellDist, etc... to improve matching but to no avail.

 

Am I barking up the wrong tree? 

 

1 ACCEPTED SOLUTION

Accepted Solutions
RussAlbright
SAS Employee
 Hi dylanleong78,
I think you pretty much have the right understanding of the process for the action. The spell correction is statistical in nature and selects rare terms as potential candidates to have been misspelled, but only when there is a frequently occurring term that is similar to it and that can serve as its correct spelling. It is designed to work on large data not so much a small set like you have. The dictionary input prevents some rare terms from being mapped as a potential misspelling, if they are on the dictionary, but that does not help you here.
 
I don't think the detailed complexTag will necessarily be of much help here. Although maybe you  have found a pattern that helps in your cases. The "inc" there represents "unknown" essentially.

Without the frequently occurring terms to map to, the algorithm won't suggest a correction.  There are a couple of things that come to mind you might try
.
1. If you find some well-edited documents (100s or 1000's of them)  to accompany your 6 , you may find that some of your  incorrect words do get corrected by including these. Also some of your tuning parameters will then become relevant  (If you want to go really extreme, you could create sentences and ultimately documents by randomly and repeatedly selecting dictionary words to create random content. You would want to automate the construction of this with some data step because you want to create a large number of these and you want your  dictionary words repeating in several documents.  Then combine your new random documents with the 6 you are trying to spell check.)
 
And a second approach that requires more work and doesn't use the action...
2. SAS has some functions for data step such as spedis (spelling edit distance) that you could potential compare every term of the offset table from tpParse with every term on your dictionary, looking for similar spellings.  This means you do a loop through every term of your collection comparing  each one to every term of your dictionary. When a term doesn't exactly match a dictionary term and it is close to some other term you could flag it. You might be surprised at how the type I and type II errors can come in to play here though.



Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

View solution in original post

2 REPLIES 2
RussAlbright
SAS Employee
 Hi dylanleong78,
I think you pretty much have the right understanding of the process for the action. The spell correction is statistical in nature and selects rare terms as potential candidates to have been misspelled, but only when there is a frequently occurring term that is similar to it and that can serve as its correct spelling. It is designed to work on large data not so much a small set like you have. The dictionary input prevents some rare terms from being mapped as a potential misspelling, if they are on the dictionary, but that does not help you here.
 
I don't think the detailed complexTag will necessarily be of much help here. Although maybe you  have found a pattern that helps in your cases. The "inc" there represents "unknown" essentially.

Without the frequently occurring terms to map to, the algorithm won't suggest a correction.  There are a couple of things that come to mind you might try
.
1. If you find some well-edited documents (100s or 1000's of them)  to accompany your 6 , you may find that some of your  incorrect words do get corrected by including these. Also some of your tuning parameters will then become relevant  (If you want to go really extreme, you could create sentences and ultimately documents by randomly and repeatedly selecting dictionary words to create random content. You would want to automate the construction of this with some data step because you want to create a large number of these and you want your  dictionary words repeating in several documents.  Then combine your new random documents with the 6 you are trying to spell check.)
 
And a second approach that requires more work and doesn't use the action...
2. SAS has some functions for data step such as spedis (spelling edit distance) that you could potential compare every term of the offset table from tpParse with every term on your dictionary, looking for similar spellings.  This means you do a loop through every term of your collection comparing  each one to every term of your dictionary. When a term doesn't exactly match a dictionary term and it is close to some other term you could flag it. You might be surprised at how the type I and type II errors can come in to play here though.



Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

dylanleong78
Fluorite | Level 6

Thanks @RussAlbright 

 

I've taken your 2nd suggestion and used a hash table lookup to lookup the parsed data set against the dictionary table and produced a data set of words that were not found in the dictionary. This doesn't provide a recommended word replacement but at least it helps identify words that are highly likely misspelt

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2690 views
  • 0 likes
  • 2 in conversation