Hi,
I've been looking into performing spellcheck against a specific column and identify misspelt words that should be corrected.
Based on reading several articles and SAS documentation (see below), I've managed to come up with a basic solution that "sort of" works.
https://towardsdatascience.com/handling-text-data-quality-issues-can-yuo-h-andle-thsi-7f77dac3ddff
https://documentation.sas.com/doc/en/pgmsascdc/v_011/casvtapg/n1vl3yuyzcl7x4n0zc4vwynknl99.htm
As you can see from spellcheck.sas, this is my thought process:
1. download the en_GB.dic file from https://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice
2. import the file into a SAS data set
3. create a sample data set containing 6 rows of text with misspelled words in the text column
4. use tpParse to parse the sample data set and produce tpParse_out data set
5. use tpSpell to attempt to identify misspelt words and produce tpSpell_out data set
6. in tpSpell_out, the _OriginalStem_ should contain the original misspelled word whereas the _Parent_ should contain the correct word
See the TPSPELL_OUT.csv for the output generated.
However, the solution only works for misspelt words (e.g. compzter) that has the same word correctly spelled repeated within the sample data set multiple times (e.g. computer). If a word (e.g. tihs) is only in the sample data set once, tpSpell will not be able to determine that "tihs" is actually "this" even though the word exists in the dictionary data set. I have a number of words misspelled in the sample data set (e.g. sentnce, neeeds, storuy, etc...)
Other misspelt words can potentially be identified by the _ComplexTag_ value = "inc" but I have no idea what "inc" means.
I've been toying with the dictPenalty, maxSpellDist, etc... to improve matching but to no avail.
Am I barking up the wrong tree?
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Thanks @RussAlbright
I've taken your 2nd suggestion and used a hash table lookup to lookup the parsed data set against the dictionary table and produced a data set of words that were not found in the dictionary. This doesn't provide a recommended word replacement but at least it helps identify words that are highly likely misspelt
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.