10-21-2011 05:06 PM
I read a very interesting article today. 'How to Write a Spelling Corrector" by Peter Norvig. In this article he outlines the process Google uses to to do it's 'Did you mean:' new spelling suggestions.
Here is a link to the article: http://norvig.com/spell-correct.html
So proc spell can identify misspelling and even provide suggestions of new words if you use the suggest option but it cannot, as far as I know, replace identified text with a new word. So what I am suggesting is that taking the vast information and algorithms outlined in the paper by Peter Norvig and you create, ideally a version or the correct function using proc fcmp, or another process by which to suggest a proper spelling to a given string or return the original string if it is a proper word.
10-21-2011 06:07 PM
Data Null: Did you mean: Punctuation Korrector?
FriedEgg: In spite of the fact that Proc Spell is no longer documented (although I do have a copy of the original documentation), I think it is an excellent suggestion and might just be enough to get SAS to re-document proc Spell.
10-24-2011 05:55 AM
I am very interested with it.
Bayes' Theorem is a very old statistical theory.
Some statistician approve ,while some oppose .
But Fortunately, SAS has already offer a function spedis to measure the distance of pronunciation.
So with the help of Hash Table, I think it will be easy to achieve in SAS.
10-24-2011 09:20 AM
Of course two problems/limitations that would have to be overcome regarding proc spell is that (1) the documentation doesn't provide any clue (at least that I could find) regarding how to access the dictionary and (2) it doesn't provide a way (at least that I could find) of capturing its output.
10-25-2011 09:37 AM
Why do you think it wouldn't work? I would think that given a string that isn't an exact match any of the entries in a dictionary, that using spedis, complev and or compged comparing that string with all of the other strings in the dictionary would have an excellent chance of solving the problem.
10-25-2011 11:42 PM
I found complev() is much better. The dictionary is that offered by FriedEgg.
But I am still interested with Peter's algorithm - Bayes' Theorem.
filename x 'c:\unix-words'; data dictionary; infile x truncover; input word : $20.; run; %let input_word=Korrecter; data _null_; if 0 then set dictionary; declare hash ha(hashexp:20,dataset : 'work.dictionary'); declare hiter hi('ha'); ha.definekey('word'); ha.definedata('word'); ha.definedone(); d=10000; do while(hi.next()=0); dd=complev("&input_word",word,'i'); if d gt dd then do; d=dd; want_word=word; end; end; put 'WORD: ' "&input_word" +2 'Did you mean: ' want_word; stop; run;
10-28-2011 01:50 PM
Here it what I have at this point. It is not exactly what is being done by Peter Norvig, but it is where I am so far.
* Based on : http://norvig.com/spell-correct.html; %let word=speling; filename big '/temp/big.txt'; * http://norvig.com/big.txt; filename words '/usr/share/dict/words'; * unix default dictionary (provided in my other word puzzle related posts); data words; infile words truncover; input word $upcase48.; run; data big; length word $48; infile big lrecl=1024 truncover; input @; _infile_=compbl(prxchange('s/[^A-Z]/ /i',-1,_infile_)); if _infile_ ne '' then do i=1 to countw(_infile_,' '); word=upcase(scan(_infile_,i,' ')); if word ne '' then output; end; drop i; run; data words; set words big; word=strip(word); run; proc freq data=words; tables word /list out=wfreq(drop=percent) noprint; run; %macro wf_find; *to avoid repeating this code block below for each correction type; if wf.find()=0 then do; clev=complev(orig_word,word); if clev<=2 then output; end; %mend; data corrections; length word a b c $48; orig_word=upcase("&word"); alphabet='ABCDEFGHIJKLMNOPQRSTUVWXYZ'; if 0 then set wfreq; declare hash wf(hashexp:10,dataset:'wfreq'); declare hiter wfi('wf'); wf.definekey('word'); wf.definedata(all:'Y'); wf.definedone(); *replaces; do i=1 to length(orig_word); do ii=1 to 26; word=orig_word; substr(word,i,1)=substr(alphabet,ii,1); %wf_find end; end; *deletes; do i=1 to length(word); word=orig_word; substr(word,i,1)=''; word=compress(word); %wf_find end; *transposes; do i=1 to length(orig_word)-1; word=orig_word; a=substr(word,i,1); b=substr(word,i+1,1); substr(word,i,1)=b; substr(word,i+1,1)=a; %wf_find end; *inserts; do i=0 to length(orig_word); word=orig_word; a=subpad(word,1,i); b=subpad(word,i+1,length(word)-i); do ii=1 to 26; c=substr(alphabet,ii,1); word=cats(of a c b); %wf_find end; end; *brute - find all words in 'dictionary' that have an edit distance of <= 2, this step should not be necessary because previous method should find all instances, however this is just to be sure; do while(wfi.next()=0); clev=complev(orig_word,word); if clev<=2 then output; end; keep orig_word word count clev; stop; run; proc sql; select distinct 'Did you mean: ' || strip(word) from corrections where clev=( select min(clev) from corrections ) and count=( select max(count) from corrections where clev=( select min(clev) from corrections )); quit;
10-30-2011 10:55 PM
Peter's Bayes method is simple, if you can use complev() of SAS.
Using Hash Table will be very easy and less code.
Of course. I can code it based on Peter's Bayes.It is easy.
But What I am concerned is how make a function complev() to test the distance of edit by myself.
That is the point. Because Peter did not use the Probability of Bayes Formula FINALLY.
Bayes's Theory is simple but calculation is very complicated.
11-01-2011 04:09 PM
The article by Norvig does slightly depart from the method of truely calculating the bayesian probability. Instead it implements a sort of logical replacement... Take the probability of the correction (the shortest edit distance) with the frequency of appearance of the corrected word in our dictionary (big.txt). The best probability will be where the correct word has the shortest edit distance and the highest appearance frequency. This is definitly not calculating the probability, but follows the logic of what the formula is accomplishing, or so Peter departs. He also goes over a vast array of issues that this does have in properly identifying corrections. In a different article, whose source I can no longer remember, I read that at google in their dictionary they use over 10 trillion 4 word strings in their dictionary to aid in the proper identification of spelling corrections (because surrounding words aid in the correction). Here is a example of the issue with this method.
I am meaning to spell 'THEY' but I acctidently type THAY
filename big '/nas/sasbox/users/mkastin/big.txt';
length word $48;
infile big lrecl=1024 truncover;
if not missing(_infile_) then
do i=1 to countw(_infile_,' ');
if word ne '' then output;
proc freq data=big;
tables word /list out=wfreq(drop=percent) noprint;
if 0 then set wfreq;
declare hash wf(hashexp:10,dataset:'wfreq');
declare hiter wfi('wf');
if clev<=2 then output;
keep orig_word word count clev;
proc sql noprint;
select min(clev) into :min_clev from corrections;
select max(count) into :max_count from corrections where clev=&min_clev;
select distinct 'Did you mean: ' || strip(word)
where clev=&min_clev and count=&max_count;
Did you mean: THAT
no, I meant 'THEY'... However, if you look at the data (here are my choices with the shortest edit distance, 1):
WORD COUNT orig_word clev
HAY 42 THAY 1
THAW 2 THAY 1
THA 1 THAY 1
THAT 12423 THAY 1
THAN 1199 THAY 1
TRAY 8 THAY 1
THY 47 THAY 1
THEY 3932 THAY 1
11-02-2011 04:09 AM
I know Bayes Formula very well.
As I said before , I agree with it at sometime ,but disagree with it at another sometime. Like other statistician.
For this case. Bayes 's calculation is simple because the distribution of variable's value is disperse not continue.
If it were continuous ,then you need to use multiple integral to get its probability. That is very horrible.
As far as I know Bayes Theory is very complimented at IT field.
11-03-2011 04:12 PM
I happened to be looking at the seminars for the next SAS Global Forum and guess what one of the pre-conference statistical tutorials is... Introduction to Bayesian Analysis Using SAS Software. Humerous coincidence.