- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The typo mistake is driving me crazy. As the sample dataset is shown below, it's very easy to find out one name by misspelling one or two letters. Is there a way to put them into one name? For example, there are five spelling in 'LARYNGOMALACIA' and four spelling in 'DERMATITIS'. Could SAS figure out all of them means one thing?
data datain1; infile datalines delimiter=','; input Name : $300. ; datalines; ALLERGIC RHINIT, ALLERGIC RHINITIS, ACUTE OTITIS ME, ACUTE OTITIS MEDIA, AIRWAY CLEARANCE IMPAIRMENT, AIRWAY CLEARENCE IMPAIRMENT, ATOPIC DERMATITIS, ATOPIC DERMATITS, ATOPIC DERMATITUS, ATOPIC DERMITITS, ECZEMA, ECZEMAN, EOSINOPHILIC ESOPHAGITIS, EOSINOPHILIC ESOPHAGITIS, EOSINOPHILIC ESOPHAITIS, IMPERFORATE ANUS, IMPERFORATED ANUS, LARYNGOMACLACIA, LARYNGOMALACIA, LARYNGOMALARIA, LARYNOMALACIA, LAYNGOMALACIA, SUBGLOTTIC STENOSIS, SUBLGOTTIC STENOSIS ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could try the SOUNDS LIKE operator, and see if that gets the job done.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
u could use different string matching algorithms such as comged, complev, soundex, spedis etc. and come up with a threshold to say all those are the same. However you need to have a standard dictionary to refer against.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you are not going to have to check the entire medical encyclopedia for terms this might give you one starting point:
proc sql; create table spelltest as select distinct a.name as NameA, b.name as NameB ,compged(a.name,b.name) as spelldif ,' ' as useA,' ' as useb from datain1 as a, datain1 as b where a.name lt b.name and compged(a.name,b.name) < 600 ; quit;
the UseA and UseB are to look at your results where the names aren't the same and pick which is the preferred versiob, unfortunately manually. If you had a known list of proper values that could be determined from membership in that data. Viewtable will open data sets in Edit mode to enter values if used properly.
Then use that marked data with a data step to create a format cntlin data set where "start" is the bad value (not selected in UseA or UseB) and "label" is the UseA or UseB selected value.
Then create the format and test by creating a new variable using the format applied to your existing names.
newname = put(name,Fixnamefmt.); for example.
This should only change the bad values.
The threshold I used for the Gompged function was just a guess. The example you show doesn't have any values of more than 200 for the spelling difference. The a.name <b.name was to prevent getting each pair duplicated, one as the A version and then as the B.Version. Normally I don't recommend < or > with character values but that was the quickest way I could think of to get single pairs of values.
But this may take a LONG time if you have 1000s of values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are you responsible for typing that information, either by yourself or someone is doing it for you?
If positive, may I suggest to hold a table/format with a code and the correct spelling. While typing the data - type the code and just three first characters for control.
importing the file, if the formatted value does not start with the 3 typed characters then that line will be rejected.
Of course, you may choose for control 4 characters or any other short abbreviation.
As for using any of the functions mentioned by @smantha , I suggest to use COMLEV function and check for diff le 5. The syntax is like in:
diff = complev(a,b);
where a is in the input, b is the correct spelling.
For a more efficient code, please post a sample of the data. You probably have hundred(s) of names and it will be not applicable to use any of those functions, as you don't know to what correct name it should be compared. Therefor I come back to my first proposal.