The article by Norvig does slightly depart from the method of truely calculating the bayesian probability. Instead it implements a sort of logical replacement... Take the probability of the correction (the shortest edit distance) with the frequency of appearance of the corrected word in our dictionary (big.txt). The best probability will be where the correct word has the shortest edit distance and the highest appearance frequency. This is definitly not calculating the probability, but follows the logic of what the formula is accomplishing, or so Peter departs. He also goes over a vast array of issues that this does have in properly identifying corrections. In a different article, whose source I can no longer remember, I read that at google in their dictionary they use over 10 trillion 4 word strings in their dictionary to aid in the proper identification of spelling corrections (because surrounding words aid in the correction). Here is a example of the issue with this method. I am meaning to spell 'THEY' but I acctidently type THAY %let word=thay; filename big '/nas/sasbox/users/mkastin/big.txt'; data big; length word $48; infile big lrecl=1024 truncover; input @; _infile_=compbl(prxchange('s/[^A-Z]/ /i',-1,_infile_)); if not missing(_infile_) then do i=1 to countw(_infile_,' '); word=upcase(scan(_infile_,i,' ')); if word ne '' then output; end; drop i; run; proc freq data=big; tables word /list out=wfreq(drop=percent) noprint; run; data corrections; if 0 then set wfreq; declare hash wf(hashexp:10,dataset:'wfreq'); declare hiter wfi('wf'); wf.definekey('word'); wf.definedata(all:'Y'); wf.definedone(); orig_word=upcase("&word"); do while(wfi.next()=0); clev=complev(orig_word,word); if clev<=2 then output; end; keep orig_word word count clev; stop; run; proc sql noprint; select min(clev) into :min_clev from corrections; select max(count) into :max_count from corrections where clev=&min_clev; quit; proc sql; select distinct 'Did you mean: ' || strip(word) from corrections where clev=&min_clev and count=&max_count; quit; Did you mean: THAT no, I meant 'THEY'... However, if you look at the data (here are my choices with the shortest edit distance, 1): WORD COUNT orig_word clev HAY 42 THAY 1 THAW 2 THAY 1 THA 1 THAY 1 THAT 12423 THAY 1 THAN 1199 THAY 1 TRAY 8 THAY 1 THY 47 THAY 1 THEY 3932 THAY 1
... View more