DATA Step, Macro, Functions and more

Speling Korrecter

Reply
Trusted Advisor
Posts: 1,300

Speling Korrecter

I read a very interesting article today.  'How to Write a Spelling Corrector" by Peter Norvig.  In this article he outlines the process Google uses to to do it's 'Did you mean:' new spelling suggestions.

Here is a link to the article: http://norvig.com/spell-correct.html

So proc spell can identify misspelling and even provide suggestions of new words if you use the suggest option but it cannot, as far as I know, replace identified text with a new word.  So what I am suggesting is that taking the vast information and algorithms outlined in the paper by Peter Norvig and you create, ideally a version or the correct function using proc fcmp, or another process by which to suggest a proper spelling to a given string or return the original string if it is a proper word.

Respected Advisor
Posts: 3,777

Speling Korrecter

Did you mean: Spelling Corrector

:smileydevil:

PROC Star
Posts: 7,360

Speling Korrecter

Data Null: Did you mean: Punctuation Korrector?

FriedEgg:  In spite of the fact that Proc Spell is no longer documented (although I do have a copy of the original documentation), I think it is an excellent suggestion and might just be enough to get SAS to re-document proc Spell.

Super User
Posts: 9,674

Speling Korrecter

I am very interested with it.

Bayes' Theorem is a very old statistical theory.

Some statistician approve ,while some oppose .

But Fortunately, SAS has already offer a function spedis to measure the distance of pronunciation.

So with the help of Hash Table, I think it will be easy to achieve in SAS.

Ksharp

PROC Star
Posts: 7,360

Speling Korrecter

Of course two problems/limitations that would have to be overcome regarding proc spell is that (1) the documentation doesn't provide any clue (at least that I could find) regarding how to access the dictionary and (2) it doesn't provide a way (at least that I could find) of capturing its output.

Trusted Advisor
Posts: 1,300

Speling Korrecter

Yes, the elusive sashelp.base.master.dictnary, I cannot figure out how to utilize this catalog either...  It seems to maybe be stored in a similar method to stored format, maybe.

Super User
Posts: 9,674

Speling Korrecter

I am so frustrated that spedis() looks like can not work for this situation.

is there someone intend to try it as Peter's algorithm.

PROC Star
Posts: 7,360

Speling Korrecter

Why do you think it wouldn't work?  I would think that given a string that isn't an exact match any of the entries in a dictionary, that using spedis, complev and or compged comparing that string with all of the other strings in the dictionary would have an excellent chance of solving the problem.

Trusted Advisor
Posts: 1,300

Speling Korrecter

The method in which I am trying this currently uses compged because I am looking only for words with a maximum edit distance of 2 (according to the article).

Super User
Posts: 9,674

Re: Speling Korrecter

I found complev() is much better. The dictionary is that offered by FriedEgg.

But I am still interested with Peter's algorithm - Bayes' Theorem.

filename x 'c:\unix-words';
data dictionary;
 infile x truncover;
 input word : $20.;
run;

%let input_word=Korrecter;
data _null_;
 if 0 then set dictionary;
 declare hash ha(hashexp:20,dataset : 'work.dictionary');
 declare hiter hi('ha');
  ha.definekey('word');
  ha.definedata('word');
  ha.definedone();
d=10000;
do while(hi.next()=0);
 dd=complev("&input_word",word,'i');
 if d gt dd then do;
                   d=dd;
                   want_word=word;
                 end;
end; 
put 'WORD: ' "&input_word" +2 'Did you mean: ' want_word;
stop;
run;
 
 

Ksharp

Trusted Advisor
Posts: 1,300

Re: Speling Korrecter

Here it what I have at this point.  It is not exactly what is being done by Peter Norvig, but it is where I am so far.

* Based on : http://norvig.com/spell-correct.html;
 
%let word=speling;
 
filename big '/temp/big.txt'; * http://norvig.com/big.txt;
filename words '/usr/share/dict/words'; * unix default dictionary (provided in my other word puzzle related posts);
 
data words;
 infile words truncover;
 input word $upcase48.;
run;
 
data big;
 length word $48;
 infile big lrecl=1024 truncover;
 input @;
 _infile_=compbl(prxchange('s/[^A-Z]/ /i',-1,_infile_));
 if _infile_ ne '' then
  do i=1 to countw(_infile_,' ');
   word=upcase(scan(_infile_,i,' '));
   if word ne '' then output;
  end;
 drop i;
run;
 
data words;
 set words big;
 word=strip(word);
run;
 
proc freq data=words;
 tables word /list out=wfreq(drop=percent) noprint;
run;
 
%macro wf_find; *to avoid repeating this code block below for each correction type;
 if wf.find()=0 then
  do;
   clev=complev(orig_word,word);
   if clev<=2 then output;
  end;
%mend;
 
data corrections;
 length word a b c $48;
 orig_word=upcase("&word");
 alphabet='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
 
 if 0 then set wfreq;
  declare hash wf(hashexp:10,dataset:'wfreq');
  declare hiter wfi('wf');
   wf.definekey('word');
   wf.definedata(all:'Y');
   wf.definedone();
 
 *replaces;
 do i=1 to length(orig_word);
  do ii=1 to 26;
   word=orig_word;
   substr(word,i,1)=substr(alphabet,ii,1);
   %wf_find
  end;
 end;
 *deletes;
 do i=1 to length(word);
  word=orig_word;
  substr(word,i,1)='';
  word=compress(word);
  %wf_find
 end;
 *transposes;
 do i=1 to length(orig_word)-1;
  word=orig_word;
  a=substr(word,i,1);
  b=substr(word,i+1,1);
  substr(word,i,1)=b;
  substr(word,i+1,1)=a;
  %wf_find
 end;
 *inserts;
 do i=0 to length(orig_word);
   word=orig_word;
   a=subpad(word,1,i);
   b=subpad(word,i+1,length(word)-i);
  do ii=1 to 26;
   c=substr(alphabet,ii,1);
   word=cats(of a c b);
   %wf_find
  end;
 end;
 *brute - find all words in 'dictionary' that have an edit distance of <= 2, this step should not be necessary because previous method should find all instances, however this is just to be sure;
 do while(wfi.next()=0);
  clev=complev(orig_word,word);
  if clev<=2 then output;
 end;
 keep orig_word word count clev;
 stop;
run;
 
proc sql;
 select distinct 'Did you mean: ' || strip(word)
   from corrections
  where clev=( select min(clev) 
                 from corrections ) 
        and count=( select max(count) 
                      from corrections 
                     where clev=( select min(clev) 
                                    from corrections ));
quit;

Super User
Posts: 9,674

Re: Speling Korrecter

Hi. FriedEgg.

Peter's Bayes method is simple, if you can use complev() of SAS.

Using Hash Table will be very easy and less code.

Of course. I can code it based on Peter's Bayes.It is easy.

But What I am concerned is how make a function complev() to test the distance of edit by myself.

That is the point. Because Peter did not use the Probability of Bayes Formula FINALLY.

Bayes's Theory is simple but calculation is very complicated.

Ksharp

Trusted Advisor
Posts: 1,300

Speling Korrecter

The article by Norvig does slightly depart from the method of truely calculating the bayesian probability.  Instead it implements a sort of logical replacement...  Take the probability of the correction (the shortest edit distance) with the frequency of appearance of the corrected word in our dictionary (big.txt).  The best probability will be where the correct word has the shortest edit distance and the highest appearance frequency.  This is definitly not calculating the probability, but follows the logic of what the formula is accomplishing, or so Peter departs.  He also goes over a vast array of issues that this does have in properly identifying corrections.  In a different article, whose source I can no longer remember, I read that at google in their dictionary they use over 10 trillion 4 word strings in their dictionary to aid in the proper identification of spelling corrections (because surrounding words aid in the correction).  Here is a example of the issue with this method.

I am meaning to spell 'THEY' but I acctidently type THAY

%let word=thay;

filename big '/nas/sasbox/users/mkastin/big.txt';

data big;

length word $48;

infile big lrecl=1024 truncover;

input @;

  _infile_=compbl(prxchange('s/[^A-Z]/ /i',-1,_infile_));

if not missing(_infile_) then

  do i=1 to countw(_infile_,' ');

   word=upcase(scan(_infile_,i,' '));

   if word ne '' then output;

  end;

drop i;

run;

proc freq data=big;

tables word /list out=wfreq(drop=percent) noprint;

run;

data corrections;

if 0 then set wfreq;

declare hash wf(hashexp:10,dataset:'wfreq');

declare hiter wfi('wf');

  wf.definekey('word');

  wf.definedata(all:'Y');

  wf.definedone();

orig_word=upcase("&word");

do while(wfi.next()=0);

  clev=complev(orig_word,word);

  if clev<=2 then output;

end;

keep orig_word word count clev;

stop;

run;

proc sql noprint;

select min(clev) into :min_clev from corrections;

select max(count) into :max_count from corrections where clev=&min_clev;

quit;

proc sql;

select distinct 'Did you mean: ' || strip(word)

   from corrections

  where clev=&min_clev and count=&max_count;

quit;

Did you mean: THAT

no, I meant 'THEY'...  However, if you look at the data (here are my choices with the shortest edit distance, 1):

WORD COUNT orig_word clev

HAY          42          THAY          1

THAW          2          THAY          1

THA          1          THAY          1

THAT          12423          THAY          1

THAN          1199          THAY          1

TRAY          8          THAY          1

THY          47          THAY          1

THEY          3932          THAY          1

Super User
Posts: 9,674

Speling Korrecter

Hi.

I know Bayes Formula very well.

As I said before , I agree with it at sometime ,but disagree with it at another sometime. Like other statistician.

For this case. Bayes 's calculation is simple because the distribution of variable's  value is disperse not continue.

If it were continuous ,then you need to use multiple integral to get its probability. That is very horrible.

As far as I know Bayes Theory is very complimented at IT field.

Ksharp

Trusted Advisor
Posts: 1,300

Speling Korrecter

I happened to be looking at the seminars for the next SAS Global Forum and guess what one of the pre-conference statistical tutorials is...  Introduction to Bayesian Analysis Using SAS Software.  Humerous coincidence.

Ask a Question
Discussion stats
  • 15 replies
  • 562 views
  • 0 likes
  • 4 in conversation