Data quality improvement is a sophisticated process. SAS provides an extra application (Dataflux) and specialised server instance (DQ Server) for this, with additional licenses.
Start with getting knowledge of your data. Run proc freq, and identify possible mistakes.
How you fix it depends wholely on the detected mistakes (uppercase/lowercase, added blanks, numeric/word notation, ...)
For names, an approach could be the following:
- set up a table with expected names
data exp_names;
infile cards dlm=',';
input expected_name :$25. actual_name :$25.;
cards;
ABC AGENCY,ABC AGENCY
ABC AGENCY,ABC AGE. Y
;
run;
You can now match input data against actual_name; if a match is found, replace with expected_name; if not, output to a new dataset indicating errors. You then inspect the error dataset and add new lines to exp_names as you see appropriate.
You may be able to create this dataset initially by using rules (first or last 5 matching, or similar) as you stated, but there will be occurences that can't be handled by a simple rule and need your intervention and the application of Brain 1.0.
All this is of course the result of poor process design. Any data used for categorization has to be checked on input (eg company names are selected from a drop-down list of companies instead of entered manually); allowing free form input of such values is sub-optimal, to say the least.
... View more