BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Steelers_In_DC
Barite | Level 11

I have a file with 675 records, for this file I have 64 rules or conditions that need to be applied, such as ID is not null, Account_Number is a length of 16, balance isn't longer than 16.6.

 

I will get a fail rate from month to month for each of the 64 rules, so I'm thinking of writing 64 if/then statements, which will generate 64 new binary columns, or 64 formats that will generate 64 new binary columns.  This doesn't sound appealing.  If I get through this process and am able to automate it I will then move on to much larger files that have much larger sets conditions checked.  The rule and the fail rate is all I am concerned about to start.  Towards the end of the project I will need to drill down and find out at an account level why something failed. 

 

This is currently in an application called Trillium, which works fine but is vary manual, my task is to automate the process. 

 

Any and all thoughts and suggestions are welcome.  Thank You,

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

I would say some of this might depend on what kind of diagnostics you need. I am assuming this is part of an ETL (input data to SAS) and validate.

 

If you need a generic error flag or to set out of expected range values to missing then Proc Format may be appropriate at the input step with custom formats. If you need a more complete diagnostic then you want data step logic to write appropriate information to log or a specific file or set specific flags variables. If a single variable gets multiple tests instead of multiple yes/no it might be worth contemplating a single variable and you add values for each pass.

An example of this. Suppose Var1 has  minimum and maximum absolute values AND should be less than (or greater than) another value:

Var1Flag=0;

If Var1 < 100 then Var1Flag = Var1Flag + 1;

If Var1 > 1000 then Var1Flag= Var1Flag + 2;

If Var1 > VarX then Var1Flag= Var1Flag + 4;

If Var1 < VarQ then Var1Flag = Var1Flag + 8;

 

Using power of 2 values for the addition then you get a value that can be parsed to get all of the errors. If you have enough similar variable codings it might even be worth creating a custom format with the values and errors

1= Below Range

2= Above range

3= Below and Above range (not realistic but for example)

4= Above reference variable

5= Below range and above reference variable (reference might be missing)

6= Above range and Above reference variable

7= Below range, above range, above reference variable

8= below other reference variable

9= below range and other reference variable

10 = above range and below other reference variable

11 =                                        (only get here with 1+2+8 so those messages)

12 = (4 and 😎

13 = (1 4 and 😎

and so forth. But this is only worth it if you have many variables with basically identical types of checks (note the range checks could be different)

 

 

I would also look for same rules applied to multiple variables. Then you could use an array to apply the same logic to each of them.

And Select/end code may reduce some if then statements. I've had data checking jobs that checked as many as 60 variable for identical ranges of coding and arrays make that so much simpler.

 

I actually do data quality reports on a number of incoming periodic sets and use the binary flags for each condition for a large number of them so I can write reports to the powers that be. Sums and means of indicators work well to show counts and percentages of types of errors that occur with Proc tabulate or report summaries.

View solution in original post

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

 

Yes, something I have looked at before and have seen many setups for it.  First off you could try one of the data management tools provided by SAS or other companies, but then you have expense, and overheads.  You can also try to do it yourself.  The coding part is simple, and will come on to that.  It is the structure for holding your "conditions" which requires the main thinking as it needs to be robust, yet not overbearing.  Other factors needs to be taken into account, are the checks heirachical, do they just drop at that one point all dos all the data get checked for every check etc.  This is not something I can really provide much on as you know your data and logic best.  

 

Lets take a simple example, I setup a dataset with my conditions:

data checks;
  length check result $100;
  check="ID=''"; result="ID Failed length"; output;
  check="lengthn(ACCOUNT_NUMBER) ne 16"; result="ACCOUNT_NUMBER Was not 16 in length"; output;
run;

data test;
  length id account_number $100;
  id="abc"; account_number="1234567890123456"; output;  /* Pass */
  id=""; account_number="1234567890123456"; output;  /* Trigger 1 */
  id=""; account_number="1234"; output;  /* Trigger both */
  id="def"; account_number="345"; output;  /* Trigger 2 */
run;

/* Do checks */
data _null_;
  set checks end=last;
  if _n_=1 then call execute('data want;  set test; length result $100;');
  call execute('if '||strip(check)||' then result="'||strip(result)||'"; output;'); 
  if last then call execute('run;');
run;

Thats a simple example, and you can see that they way checks is setup needs to be well thought out and robust, the actual code generation part is simple.  

Astounding
PROC Star

My inclination would be to create a single variable  that tracks all 64 tests.  It would be $64, and consist of "P" for pass and "F" for fail in each position.  Remember, you can use SUBSTR on the left-hand side of an equal sign:

 

if /* failed on test 7 */ then substr(pass_fail, 7, 1) = 'F';

 

You do need to maintain separate documentation about the position of each of the 64 tests.

ballardw
Super User

I would say some of this might depend on what kind of diagnostics you need. I am assuming this is part of an ETL (input data to SAS) and validate.

 

If you need a generic error flag or to set out of expected range values to missing then Proc Format may be appropriate at the input step with custom formats. If you need a more complete diagnostic then you want data step logic to write appropriate information to log or a specific file or set specific flags variables. If a single variable gets multiple tests instead of multiple yes/no it might be worth contemplating a single variable and you add values for each pass.

An example of this. Suppose Var1 has  minimum and maximum absolute values AND should be less than (or greater than) another value:

Var1Flag=0;

If Var1 < 100 then Var1Flag = Var1Flag + 1;

If Var1 > 1000 then Var1Flag= Var1Flag + 2;

If Var1 > VarX then Var1Flag= Var1Flag + 4;

If Var1 < VarQ then Var1Flag = Var1Flag + 8;

 

Using power of 2 values for the addition then you get a value that can be parsed to get all of the errors. If you have enough similar variable codings it might even be worth creating a custom format with the values and errors

1= Below Range

2= Above range

3= Below and Above range (not realistic but for example)

4= Above reference variable

5= Below range and above reference variable (reference might be missing)

6= Above range and Above reference variable

7= Below range, above range, above reference variable

8= below other reference variable

9= below range and other reference variable

10 = above range and below other reference variable

11 =                                        (only get here with 1+2+8 so those messages)

12 = (4 and 😎

13 = (1 4 and 😎

and so forth. But this is only worth it if you have many variables with basically identical types of checks (note the range checks could be different)

 

 

I would also look for same rules applied to multiple variables. Then you could use an array to apply the same logic to each of them.

And Select/end code may reduce some if then statements. I've had data checking jobs that checked as many as 60 variable for identical ranges of coding and arrays make that so much simpler.

 

I actually do data quality reports on a number of incoming periodic sets and use the binary flags for each condition for a large number of them so I can write reports to the powers that be. Sums and means of indicators work well to show counts and percentages of types of errors that occur with Proc tabulate or report summaries.

Ksharp
Super User

You could check Integrate Constraint and use Audit table to print which obs is not satisfied these condition. Like:

 

data have;

length a 8 b $ 20 ...........;

run;

 

proc datasets library.work;

 modify have;

 ic create check_a=check(where=( a between 10 and  20)  message='A must be 10-20');

 ic create ...........;

 

init have;

audit;

quit;

 

proc append base=have data=want;run;

 

proc print data=have(type=audit);run;

Reeza
Super User
If your rules change often think about how you'll track changes in rule implementation over time.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1005 views
  • 0 likes
  • 6 in conversation