Automated Data Cleaning

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Automated Data Cleaning

Working with a wide dataset of 500+ variables, need to make sure all values in each row fall within their respective domains. Right now I am using a brute force approach of typing an if then statement that prints the study id, variable name, and value if the value falls outside of the domain:

 

data _null_;
set tmp2.bq_module_3;
file print;

if TQ301 not in (1:5,88,99) then put STUDYID 'TQ301 ' TQ301;

if TQ302 not in (1:10) then put STUDYID 'TQ302 ' TQ302;

if TQ303 not in (1:1000) then put STUDYID 'TQ303 ' TQ303;

run;

 

What I'd like is a program that will only require me to enter the domain for each variable, something like:

 

TQ301 DOMAIN = (1:5, 88, 99, .)

TQ302 DOMAIN = (1:10)

TQ303 DOMAIN = (1:1000)

 

For TQ301-TQ303 do;

if [value] not in [domain] then print STUDYID 'variable name' [value];

 

Output would look like this:

 

STUDYNO TQ301 6

STUDYNO TQ302 11

STUDYNO TQ303 1001 


Accepted Solutions
Solution
‎05-31-2017 11:31 AM
Super User
Posts: 9,867

Re: Automated Data Cleaning


Make a Hash Table to hold these domain. and CHECK it in data step.


View solution in original post


All Replies
Super User
Posts: 11,121

Re: Automated Data Cleaning

[ Edited ]

Without seeing an example of your data I really wonder about 3 "study identification" variables on a single record. Do you mean that you have data as such that one record may have data from 3 (or possibly more) studies?

 

I will submit that negative definitions are also a bit weak as 1001 is only yielding a result of TQ303 because it is overwriting the values your code assigned in the first two conditions.

 

Can you provide some example data of those variables? I suspect there may be a way to do this with a format but my initial approach would require only one of your variables TQ301, TQ302 and TQ303 to be defined (not missing) on each record.

 

Also the way your PUT statements are structured it looks you have a variable named studyid or is that a typo for generating your example desired output of "STUDYNO"???

Solution
‎05-31-2017 11:31 AM
Super User
Posts: 9,867

Re: Automated Data Cleaning


Make a Hash Table to hold these domain. and CHECK it in data step.


☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 138 views
  • 1 like
  • 3 in conversation