One thing, what should the result of the "cleaning" be?
Second, did anyone provide a document of what the acceptable answers should be?
Third by "import" do you mean you used Proc Import to read the data?
With data that has expected values but questionable data entry I quite often write custom informat to read the expected values and adjust it to correct bad values as they occur. I use the _error_ option to write notes to the log of unexpected values so I can add them to the informat.
Then use a data step to read the data because Proc Import cannot do any such.
A brief example of the process.
proc format ;
invalue $gender (upcase)
'FEMALE' = 'F'
'MALE' = 'M'
other = _error_
;
data example;
input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;
This will create values of F and M (you did not indicate what you wanted for the values) or missing when something not in the informat is encountered.
The log for the data step looks like:
28 data example;
29 input x :gender.;
30 datalines;
NOTE: Invalid data for x in line 37 1-7.
RULE: ----+----1----+----2----+----3----+----4----+----5---
37 feemale
x=. _ERROR_=1 _N_=7
NOTE: Invalid data for x in line 38 1-3.
38 mal
x=. _ERROR_=1 _N_=8
NOTE: Invalid data for x in line 39 1-5.
39 maale
x=. _ERROR_=1 _N_=9
NOTE: The data set WORK.EXAMPLE has 9 observations and 1
variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time
The UPCASE option means any value encountered is converted to all uppercase before comparison with values in the Informat. So I can use the LOG information to add the upper case versions of the bad values to the Informat.
proc format ;
invalue $gender (upcase)
'FEMALE' ,'FEEMALE' = 'F'
'MALE','MAL','MAALE' = 'M'
other = _error_
;
data example;
input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;
I use this also for things like location codes as my data sources will often forget to tell me when a new site is activated. So I have a "valid site code" format that throws an error like above when the new code appears in the data so I can ask about the related details such as "where is the site", "what is its name" and relevant items used in reporting.
The above approach will recode any acceptable text though embedded quotes get a very little bit tricky.
If some of the values are expected to be blank you can use a ' ' = some code to let you know they were blank or missing.
... View more