BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
shortyofhb
Calcite | Level 5

Hello all,

I have a data cleaning question. I was given an external dataset that I imported into SAS. One of the tasks is to clean the dataset and rename the variables.

How would I go about cleaning a variable that has multiple different values. For example, my gender variable has Female, Femalee, Fame, emale, etc and Male, Malee, ale, Maale, etc.

 

I initially tried to use an If, Then statement but didn't know how to do it since not all the values start with M or F.

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Few suggestions to think about:

Step 1 - compress any pair of sequence letters (like 'aa' or 'ee' etc.) to a single one ('a' or 'e' etc.).

Step2 - 

If 1st letter is capital 'F' then it is Female, or if it is 'M' then it is Male otherwise:

if length(gender) > 4 then it is Female else it is unknown, because 'ale' can be both.

 

If you check all unknown gender you may find some more rules to decide is it a Male or Female. (for example 'eminin' as part of 'Feminine' means 'Female' or 'Nale' is 'Male' as 'N' and 'M' are near letters in QWERT keboard);

 

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

Let's start with some logic here. What if the value is "mal"? Do you want it to be Make or Female? Could be both?

shortyofhb
Calcite | Level 5

I think if it is just "mal" I would like to make it "Male"

PeterClemmensen
Tourmaline | Level 20

Here is a brute force mehtod. See if you can use that as a template.

 

data have;
input gender $;
datalines;
Female  
Femalee 
Fame    
emale   
Male    
Malee   
ale     
Maale   
;

data want;
   set have;
   if      gender in ("Female", "Femalee", "Fame", "emale") then newgender = "Female";
   else if gender in ("Male", "Malee", "ale", "Maale") then newgender = "Male";
run;
shortyofhb
Calcite | Level 5

Its a huge dataset, but if I run a proc freq it should show me the different values and then I could input those that pop up right?

shortyofhb
Calcite | Level 5

I have to do the same thing but with cities, do you know how I would incorporate propcase into it?

Would I do the same, and then just make a new function that propases the new variable name?

Shmuel
Garnet | Level 18

Few suggestions to think about:

Step 1 - compress any pair of sequence letters (like 'aa' or 'ee' etc.) to a single one ('a' or 'e' etc.).

Step2 - 

If 1st letter is capital 'F' then it is Female, or if it is 'M' then it is Male otherwise:

if length(gender) > 4 then it is Female else it is unknown, because 'ale' can be both.

 

If you check all unknown gender you may find some more rules to decide is it a Male or Female. (for example 'eminin' as part of 'Feminine' means 'Female' or 'Nale' is 'Male' as 'N' and 'M' are near letters in QWERT keboard);

 

gamotte
Rhodochrosite | Level 12

Hello,

 

You can use the complev function that computes a distance between strings (Levenshtein distance).

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lefunctionsref&docsetTarg...

Using @PeterClemmensen dataset :

 

data have;
input gender $;
datalines;
Female  
Femalee 
Fame    
emale   
Male    
Malee   
ale     
Maale   
;

data want;
   set have;
   gender2=ifc(complev(gender,"MALE","i")<complev(gender,"FEMALE","i"), "Male", "Female");
run;

Note that "Fame" becomes "Male" so you can add other criteria that fit better the actual contents of your dataset.

ballardw
Super User

One thing, what should the result of the "cleaning" be?

Second, did anyone provide a document of what the acceptable answers should be?

Third by "import" do you mean you used Proc Import to read the data?

 

 

With data that has expected values but questionable data entry  I quite often write custom informat to read the expected values and adjust it to correct bad values as they occur. I use the _error_ option to write notes to the log of unexpected values so I can add them to the informat.

Then use a data step to read the data because Proc Import cannot do any such.

 

A brief example of the process.

proc format ;
invalue $gender (upcase)
'FEMALE' = 'F'
'MALE'   = 'M'
other    = _error_
;

data example;
   input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;

This will create values of F and M (you did not indicate what you wanted for the values) or missing when something not in the informat is encountered.

The log for the data step looks like:

28   data example;
29      input x :gender.;
30   datalines;

NOTE: Invalid data for x in line 37 1-7.
RULE:      ----+----1----+----2----+----3----+----4----+----5---
37         feemale
x=. _ERROR_=1 _N_=7
NOTE: Invalid data for x in line 38 1-3.
38         mal
x=. _ERROR_=1 _N_=8
NOTE: Invalid data for x in line 39 1-5.
39         maale
x=. _ERROR_=1 _N_=9
NOTE: The data set WORK.EXAMPLE has 9 observations and 1
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time

The UPCASE option means any value encountered is converted to all uppercase before comparison with values in the Informat. So I can use the LOG information to add the upper case versions of the bad values to the Informat.

 

proc format ;
invalue $gender (upcase)
'FEMALE' ,'FEEMALE' = 'F'
'MALE','MAL','MAALE'   = 'M'
other    = _error_
;

data example;
   input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;

I use this also for things like location codes as my data sources will often forget to tell me when a new site is activated. So I have a "valid site code" format that throws an error like above when the new code appears in the data so I can ask about the related details such as "where is the site", "what is its name" and relevant items used in reporting.

 

The above approach will recode any acceptable text though embedded quotes get a very little bit tricky.

If some of the values are expected to be blank you can use a  ' ' = some code to let you know they were blank or missing.

 

 

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1678 views
  • 0 likes
  • 5 in conversation