Hi guys,
I was wondering if I can use SAS to detect any data entry errors, such as entering characters in numerical field, numbers in date fields, or entering something not in a certain format (e.g. xx/xxx/xx instead of just xx/xxx). Right now I just use proc print to see if anything looks out of place, but if the dataset becomes large then an automated way would be nicer.
Any advice or where I can find some clue would be greatly appreciated!
Thank you.
Yes, SAS is great for finding invalid data. Look into all the SAS functions.
For example, imagine you have a character variable that should only have letters in it, you can check that with:
data want;
input var $3.;
if notalpha(var) then put "Invalid value " var=;
cards;
AAA
B1B
CCC
;
run;
Ron Cody wrote an excellent book on using SAS for data cleaning. Thir edition just came out.
Yes, SAS is great for finding invalid data. Look into all the SAS functions.
For example, imagine you have a character variable that should only have letters in it, you can check that with:
data want;
input var $3.;
if notalpha(var) then put "Invalid value " var=;
cards;
AAA
B1B
CCC
;
run;
Ron Cody wrote an excellent book on using SAS for data cleaning. Thir edition just came out.
Thank you for the advice!
I found a few chapters of the second edition online, will try to read it and see what I can find!
Here's the method I like for detecting invalid data when a character field is supposed to contain a valid number:
if (varname > ' ') and input(varname, ?? 12.) = .;
Instead of doing corrections at later stage it is good to have Integrity constaints up front. So that data entry errors even do not happen. It is very difficult to analyse data entry errors and becomes really cumbersome after a point.
In my previous project we used to have regex table(a lookup table) with valid values for a particular column. If the values from your table matches to that of lookup table then it marked as good record with a flag value 'Y' else it was marked as a bad record with a Flag value 'N'. This helps your team to understand what are good records and what are bad reords.
This was handled within a macro, so similar tables could be tested in production.
For this things to have you need to define your business rules too(i.e. what does a bad record mean).
Link regarding integrity constraints
https://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000403555.htm
In addition to @kiranv_ I routinely create custom informats to trap many classes of invalid data: Range of numerics, specific values of numeric or character values and even turn text such as Yes, No Null into 1/0/. .
This works best when you read external data with a data step where you assign the informats. If you have an informat with an other = _error_ for the out of range responses the LOG will show the errors (unless you have so many that you exceed the number of displayed errors your system is set to display).
The _error_ will also result in a missing value in the data set.
This approach does indicate either a willingness to correct errors manually in the input data and re-read or leave missing data for unexpected/ invalid values.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.