BookmarkSubscribeRSS Feed
dkeuch816
Calcite | Level 5

Hi all,

 

I am importing data over from an excel sheet into Sas EG. I have to create a new data set from that data where there are no mispellings, extra spaces, etc. I have a Character Variable, Subject, where some of it's observations are mispelled and i need to fix them. please help with the correct data step procedure.

7 REPLIES 7
ballardw
Super User

Not a trivial exercise and made harder by not providing any example data.

 

What types of values do you have? People names (misspelling almost impossible to address), company names, City names, drugs, machinery ?

 

Do you know what the correct spellings should be for things? Are the misspellings in the middle of other text such as the middle of a sentence?

 

If you have single values that is misspelled such as "BIM" instead of "IBM" then there are tests that can be done to determine if the word is similar to another word and if similar enough you might replace it with a value from a known list. But we would kind of need to have some examples of the misspelled value and the correct value.

 

Of if feeling brave Excel does have a spell checker with autocorrect built in. Try that before importing.

dkeuch816
Calcite | Level 5
I need to change acconting to accounting and mth to match. Its a dataset of teachers and subject is a table implying the subject they teach. I tried
If subject=acconting then subject=accounting. All that did was create a new variable and list all values as missing.
Reeza
Super User
Note the quotes in my code. You can run that by the way, sashelp.class hould exist on your system.
ballardw
Super User

Here is an alternate approach using a format that will do the same thing that @Reeza's example does.

proc format library=work;
value $rename
'Alfred' = 'Alfie'
'John'   = 'Johnny'
;
run;

data example;
   set sashelp.class;
   name = put(name,$rename.);
run;

The format used this way can seriously reduce the number of IF/THEN/ELSE or other logic statements. And if you are doing this repeatedly with multiple data sets such that new errors appear they can be easily added to the format.

Suppose I have developed a format that works for your example for accounting which could look like:

proc format library=work;
value $rename
'acconting' = 'accounting'
;
run;

and then the next time a file needs to be processed you find a value of 'acounting'. Just add it to the format:

proc format library=work;
value $rename
'acconting', 'acounting' = 'accounting'
;
run;

No additional code changes need be made.

This format approach works because any value that does not appear on the left of an = sign in the value block is not reformatted. So the correct values are left alone.

 

PS: From the description I would suspect that you might be changing "mth" to "math" not "match".

 

And you can get a reasonable start on the values that go on the left by using code like:

 

Proc sql;
    select distinct subject
    from yourdataset
    ;
quit;

Which will create a list in the results window of all the values of subject that you can copy and paste into the editor for proc format. Remove the correct spellings. The value lists in proc format do not have to be all on one line so making a value statement like this should be moderately easy.

proc format library=work;
value $rename
'acconting' , 
'acounting' = 'accounting'
'matth' ,
'mth'    = 'math'
;
run;

 

dkeuch816
Calcite | Level 5
Thank you so much that worked!!
Reeza
Super User

These are usually brute force unfortunately, unless you have systematic rules that need to be applied. 

 

 

 

data clean;
set sashelp.class;

if name = 'Alfred' then name = 'Alfie';
else if name = 'John' then name = 'Johnny';

run;

 

 


@dkeuch816 wrote:

Hi all,

 

I am importing data over from an excel sheet into Sas EG. I have to create a new data set from that data where there are no mispellings, extra spaces, etc. I have a Character Variable, Subject, where some of it's observations are mispelled and i need to fix them. please help with the correct data step procedure.




ScottBass
Rhodochrosite | Level 12

If your data volume is small enough, what I would do is:

 

* Get a distinct list of your subjects (either proc sort nodupkey or proc sql select distinct)

* Review the distinct list of data.

* For misspelled data, create a column such as correct_spelling with the correct spelling.

 

You've now just created your custom dictionary.

 

With each new data feed, append to your custom dictionary.

 

Rejoin back to your data, assigning correct_spelling to subject where correct_spelling is not missing.

 

The compbl function can help with multiple spaces:

 

data test;
   have='This is  a    string with   multiple        spaces';
   want=compbl(have);
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1714 views
  • 2 likes
  • 4 in conversation