SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
ybz12003
Rhodochrosite | Level 12

Hello,

The typo mistake is driving me crazy.   As the sample dataset is shown below, it's very easy to find out one name by misspelling one or two letters.   Is there a way to put them into one name?   For example, there are five spelling in 'LARYNGOMALACIA' and four spelling in 'DERMATITIS'.   Could SAS figure out all of them means one thing?

 

data datain1;
      infile datalines delimiter=',';
  input Name : $300.  ;
datalines;
	ALLERGIC RHINIT,
	ALLERGIC RHINITIS,
	ACUTE OTITIS ME,
	ACUTE OTITIS MEDIA,
	AIRWAY CLEARANCE IMPAIRMENT,
	AIRWAY CLEARENCE IMPAIRMENT,
	ATOPIC DERMATITIS,
	ATOPIC DERMATITS,
	ATOPIC DERMATITUS,
	ATOPIC DERMITITS,
	ECZEMA,
	ECZEMAN,
	EOSINOPHILIC ESOPHAGITIS,
	EOSINOPHILIC ESOPHAGITIS,
	EOSINOPHILIC ESOPHAITIS,
	IMPERFORATE ANUS,
	IMPERFORATED ANUS,
	LARYNGOMACLACIA,
	LARYNGOMALACIA,
	LARYNGOMALARIA,
	LARYNOMALACIA,
	LAYNGOMALACIA,
	SUBGLOTTIC STENOSIS,
	SUBLGOTTIC STENOSIS
;
run;
4 REPLIES 4
PaigeMiller
Diamond | Level 26

You could try the SOUNDS LIKE operator, and see if that gets the job done.

 

https://documentation.sas.com/?docsetId=sqlproc&docsetTarget=p0a62rd151ctown1x38ihdpjozyv.htm&docset...

--
Paige Miller
smantha
Lapis Lazuli | Level 10

u could use different string matching algorithms such as comged, complev, soundex, spedis etc. and come up with a threshold to say all those are the same. However you need to have a standard dictionary to refer against.

ballardw
Super User

If you are not going to have to check the entire medical encyclopedia for terms this might give you one starting point:

proc sql;
   create table spelltest as
   select distinct a.name as NameA, b.name as NameB
     ,compged(a.name,b.name) as spelldif
     ,' ' as useA,' ' as useb
   from datain1 as a, datain1 as b
   where a.name lt b.name and 
        compged(a.name,b.name) < 600
   ;
quit;

the UseA and UseB are to look at your results where the names aren't the same and pick which is the preferred versiob, unfortunately manually. If you had a known list of proper values that could be determined from membership in that data. Viewtable will open data sets in Edit mode to enter values if used properly.

 

Then use that marked data with a data step to create a format cntlin data set where "start" is the bad value (not selected in UseA or UseB) and "label" is the UseA or UseB selected value.

Then create the format and test by creating a new variable using the format applied to your existing names.

newname = put(name,Fixnamefmt.); for example.

This should only change the bad values.

 

The threshold I used for the Gompged function was just a guess. The example you show doesn't have any values of more than 200 for the spelling difference. The a.name <b.name was to prevent getting each pair duplicated, one as the A version and then as the B.Version. Normally I don't recommend < or > with character values but that was the quickest way I could think of to get single pairs of values.

But this may take a LONG time if you have 1000s of values.

Shmuel
Garnet | Level 18

Are you responsible for typing that information, either by yourself or someone is doing it for you?

If positive, may I suggest to hold a table/format with a code and the correct spelling. While typing the data - type the code and just three first characters for control.

 

importing the file, if the formatted value does not start with the 3 typed characters then that line will be rejected.

 

Of course, you may choose for control 4 characters or any other short abbreviation.

 

As for using any of the functions mentioned by @smantha , I suggest to use COMLEV function and check for diff le 5. The syntax is like in:

diff = complev(a,b);

where a is in the input, b is the correct spelling.

 

For a more efficient code, please post a sample of the data. You probably have hundred(s) of names and it will be not applicable to use any of those functions, as you don't know to what correct name it should be compared. Therefor I come back to my first proposal.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1071 views
  • 0 likes
  • 5 in conversation