Output data errors with 193 criteria

Reply
Occasional Contributor
Posts: 16

Output data errors with 193 criteria

I am trying to detect data errors that occur outside the accptable range.  I have 193 variables. I want to keep the ID, Obs, Var(one specifiec var - not all 193) to 193 datasets. Then set them into one....  How whould I apply each cirterial to the apporopirate variable and get it all the erros back into one dataset?  Thank you. 

Data Errors;
set coverbtch;
where p0001 not in (.,1,2,3,4,5,6,7); keep id obx p0001;
where p0002 notin (1,2);
where p0003 >40;
where p0004 notin (.) and (p0004<2005 or p0004>2015);
where p0005>3000;
where p0006 notin &countylist;
where p0007 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13);
where p0008>999;
where p0009 notin (.) and (p0009<1 or p0009>50);
where p0010 notin (.,1,2,3);
where p0011 >100;
where p0012 notin (.,1,2,3,4);
where p0013 notin (.) and (p0013<6 or p0013>60);
where p0014 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14);
where p0015 notin (.,1,2,3,4,5,6,7,8,9);
where p0016 notin (.) and (p0016<12 or p0016>32);
where p0017 notin (.) and (p0017<8 or p0017>10);
where p0018 notin (.) and (p0018<8 or p0018>12);
where p0019 notin (.) and (p0019<8 or p0019>12);
where p0020 notin (.) and (p0020<10 or p0020>40);
where p0021 notin (.) and (p0021<10 or p0021>30);
where p0022 notin (.) and (p0022<8 or p0022>40);
where p0023 notin (.,1,2,3,4,5);
where p0024 notin (.) and (p0024<1 or p0024>50);
where p0025 notin (.) and (p0025<1 or p0025>50);
where p0026 notin (1,2,3);
where p0027 notin (.) and (p0027<1 or p0027>100);
where p0028 notin (.) and (p0028<1 or p0028>50);
where p0029 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14);
where p0030 notin (.,1,2,3,4,5,6);;
where p0031 notin (1,2);
where p0032 ne p0032<1 or p0032>5;
where p0033 notin (.) and (p0033<1 or p0033>50);
where p0034 notin (1,2);
where p0035 notin (1,2);
where p0036 notin (1,2,3);
where p0037 notin (.) and (p0037<1 or p0037>100);
where p0038 notin (.) and (p0038<1 or p0038>50);
where p0039 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14);
where p0040 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17);
where p0041 notin (.) and (p0041<8 or p0041>12);
where p0042 notin (.) and (p0042<12 or p0042>25);
where p0043 notin (.) and (p0043<2 or p0043>8);
where p0044 notin (.) and (p0044<10 or p0044>30);
where p0045 notin (.) and (p0045<14 or p0045>28);run;
where p0046 notin (.) and (p0046<8 or p0046>16);
where p0047 notin (.) and (p0047<10 or p0047>59);
where p0048 notin (.) and (p0048<10 or p0048>59);
where p0049 notin (.) and (p0049<20 or p0049>35);
where p0050 notin (.) and (p0050<22 or p0050>30);
where p0051 notin (.) and (p0051<8 or p0051>12);
where p0052 notin (1,2);
where p0053 notin (.) and (p0053<1 or p0053>5);
where p0054 >30;
where p0055 notin (1,2);
where p0056 notin (1,2);
where p0057 notin (1,2,3);
where p0058 >100;
where p0059 notin (.) and (p0059<5 or p0059>19);
where p0060 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0061 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13);
where p0062 notin (1,2);
where p0063 ne p0063<1 or p0063>5;
where p0064 >40;
where p0065 notin (.,1,2);
where p0066 notin (.,1,2);
where p0067 notin (.,1,2,3);
where p0068 >200;
where p0069 notin (.,1,2);
where p0070 >70;
where p0071 >5;
where p0072 >1000;
where p0073 <20 or p0073>260;
where p0074 notin (.,1,2);
where p0075 <20 or p0075>260;
where p0076 notin (.,1,2);
where p0077 notin (.,1,2);
where p0078 notin (.,1,2);
where p0079 notin (.,1,2);
where p0080 notin (.,1,2);
where p0081 notin (.,1,2);
where p0082 notin (.,1,2);
where p0083 notin (.,1,2);
where p0084 notin (.,1,2);
where p0085 notin (.,1,2);
where p0086 notin (.,1,2);
where p0087 notin (.) and (p0087<1 or p0087>10);
where p0088 notin (.) and (p0088<1 or p0088>30);
where p0089 notin (.) and (p0089<1 or p0089>30);
where p0090 notin (.) and (p0090<1 or p0090>80); run;
where p0091 notin (.) and (p0091<1 or p0091>30);
where p0092 notin (.) and (p0092<1 or p0092>10);
where p0093 notin (.) and (p0093<1 or p0093>10);
where p0094 notin (.) and (p0094<1 or p0094>50);
where p0095 notin (.) and (p0095<1 or p0095>50);
where p0096 notin (.) and (p0096<1 or p0096>20);
where p0097 notin (.) and (p0097<1 or p0097>10);
where p0098 notin (.) and (p0098<1 or p0098>10);
where p0099 notin (.) and (p0099<1 or p0099>80);
where p0100 notin (.) and (p0100<1 or p0100>40);
where p0101 notin (.) and (p0101<1 or p0101>80);
where p0102 notin (.) and (p0102<1 or p0102>60);
where p0103 notin (.) and (p0103<1 or p0103>80);
where p0104 notin (.) and (p0104<1 or p0104>10);
where p0105 notin (.) and (p0105<1 or p0105>30);
where p0106 notin (.) and (p0106<1 or p0106>50);
where p0107 notin (.) and (p0107<1 or p0107>40);
where p0108 notin (.) and (p0108<1 or p0108>20);
where p0109 >100;
where p0110 notin (.,1,2);
where p0111 notin (.,1,2);
where p0112 notin (.,1,2);
where p0113 notin (.,1,2);
where p0114 notin (.,1,2);
where p0115 notin (.,1,2);
where p0116 ne p0116<1 or p0116>50;
where p0117 ne p0117<1 or p0117>5;
where p0118 notin (.,1,2,3,4,5);
where p0119 notin (.,1,2,3,4,5);
where p0120 notin (.,1,2,3);
where p0121 notin (.,1,2,3);
where p0122 notin (.,1,2,3);
where p0123 >100;
where p0124 >100;
where p0125 >100;
where p0126 notin (.,1,2,3);
where p0127 notin (.,1,2,3);
where p0128 notin (.,1,2,3);
where p0129 >100;
where p0130 >100;
where p0131 >100;
where p0132 ne p0132<1 or p0132>20;
where p0133 ne p0133<1 or p0133>20;
where p0134 ne p0134<1 or p0134>20;
where p0135 ne p0135<1 or p0135>20;
where p0136 ne p0136<1 or p0136>20;
where p0137 ne p0137<1 or p0137>20;
where p0138 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0139 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0140 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0141 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0142 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0143 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
where p0144 ne p0144<8 or p0144>40;
where p0145 ne p0145<8 or p0145>40;
where p0146 ne p0146<8 or p0146>40;
where p0147 ne p0147<8 or p0147>40;
where p0148 ne p0148<8 or p0148>40;
where p0149 ne p0149<8 or p0149>40;
where p0150 ne p0150<8 or p0150>40;
where p0151 ne p0151<8 or p0151>40;
where p0152 ne p0152<8 or p0152>40;
where p0153 ne p0153<8 or p0153>40;
where p0154 ne p0154<8 or p0154>40;
where p0155 >0;
where p0156 ne p0156<1 or p0156>17;
where p0157 ne p0157<1 or p0157>17;
where p0158 ne p0158<1 or p0158>17;
where p0159 ne p0159<1 or p0159>17;
where p0160 ne p0160<1 or p0160>17;;
where p0161 ne p0161<1 or p0161>17;
where p0162 ne p0162<1 or p0162>5;
where p0163 notin (.,1,2);
where p0164 ne p0164<1 or p0164>13;
where p0165 notin (.,1);
where p0166 notin (.,2);
where p0167 notin (.,3);
where p0168 notin (.,4);
where p0169 notin (.,5);
where p0170 notin (.,6);
where p0171 notin (.,7);
where p0172 notin (.,8);
where p0173 notin (.,9);
where p0174 notin (.,10);
where p0175 notin (.,11);
where p0176 notin (.,12);
where p0177 notin (.,13);
where p0178 notin (.,1,2,3,4,5);
where p0179 notin (.,1,2,3,4,5);
where p0180 notin (.,1,2,3,4,5);
where p0181 notin (.,1,2,3,4,5);
where p0182 notin (.,1,2,3,4,5);
where p0183 notin (.,1,2,3,4,5);
where p0184 notin (.,1,2,3,4,5);
where p0185 notin (.,1,2,3,4,5);
where p0186 notin (.,1,2,3);
where p0187 notin (.,1,2,3);
where p0188 notin (.,1,2,3);
where p0189 notin (.,1,2,3);
where p0190 notin (.,1,2,3);
where p0191 notin (.,1,2);
where p0192 notin (.,14);
where p0193 >1;
run;

Respected Advisor
Posts: 3,066

Re: Output data errors with 193 criteria

[ Edited ]

For starters the number of unique error checks is a lot less than 193. For example the check not in (.,1,2) is applied to over 20 variables but can be considered just one check. I suggest you dedup your checks down do a unique list, then figure out which variables they apply to.

 

For doing one check on multiple variables an array is the way to go. I would also create a binary flag for each check also using an array. Here is a simple example for the not in (.,1,2) check:

 

 

data want;
  set have;
  array pvars (*) p0001 - p0193;
  array checks (*) check1 - check193;
  do i = 65, 66, 69, 74, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86;
    checks(i) = (pvars(i) not in (.,1,2));
  end;
  if sum(of check1 - check193) > 0;
run;
    

 

 

Occasional Contributor
Posts: 16

Re: Output data errors with 193 criteria

Thank you SA Skiwi. I have revised the code to use if statements. Im not understanding the array... I can look into it... Where there are multiple errors per ID, can the array create multiple flags per ID? The way this is written, there is only one output for one error per ID.
Data Errors;
set coverbtch;
if p0001 notin (.,1,2,3,4,5,6,7) then do; Flag='p0001'; pval=p0001; end;
if p0002 notin (.,1,2) then do; Flag='p0002'; pval=p0002; end;
if p0003 notin (.) and p0003>40 then do; Flag='p0003'; pval=p0003; end;
if p0004 notin (.) and (p0004<2005 or p0004>2015) then do; Flag='p0004'; pval=p0004; end;
if p0005 notin (.) and p0005>3000 then do; Flag='p0005'; pval=p0005; end;
if p0006 notin &countylist then do; Flag='p0006'; pval=p0006; end;
if p0007 notin (.,1,2,3,4,5,6,7,8,9,10,11,12,13) then do; Flag='p0007'; pval=p0007; end;
if p0008 notin (.) and p0008>999 then do; Flag='p0008'; pval=p0008; end;
if p0009 notin (.) and (p0009<1 or p0009>50) then do; Flag='p0009'; pval=p0009; end;
Etc.
Etc.
thank you!
Respected Advisor
Posts: 4,998

Re: Output data errors with 193 criteria

[ Edited ]

The simplest answer is that you would never do this.  Consider what you would end up with.  Your final data set after putting the 193 data sets together again would contain all the variables.  But it would contain fewer observations ... those with something out of range.  And the variable values would change:  anything that was originally in range would now contain a missing value.  All of that can be accomplished in a single DATA step, without creating 193 separate data sets along the way.  For example:

 

if p0003 > 40 then p0003 = .;

 

Then at the end, something like:

]

if n(of p0001 - p0193) = 0;

 

When I say "something like" it's because I can't tell whether COUNTY is numeric or character.  And I can't tell whether there are other numeri variables (not being checked) that need to be accounted for when selecting.

 

First, you will need to fix many of your checks.  They are not all valid SAS code.  And you will need to consider whether a missing value is going to be considered something that needs to be checked.  Some of your checks flag missing values.  But the general objective should be to create one data set, without separating into 193 data sets.

SAS Super FREQ
Posts: 8,721

Re: Output data errors with 193 criteria

Hi:

  And, in addition to all the other good suggestions you've received, I'd like to point out that stringing your WHERE statements like that is not a good practice, as shown below. Did you notice the messages in the log about the WHERE clause being replaced???

 

cynthiawhere_clause_replaced.png

Occasional Contributor
Posts: 16

Re: Output data errors with 193 criteria

Thank you Cynthia.  

Occasional Contributor
Posts: 16

Re: Output data errors with 193 criteria

All variables are numeric.
Grand Advisor
Posts: 10,239

Re: Output data errors with 193 criteria

Is the purpose of this exercise to varify that values are within expected ranges for the variables?

If that is the case you may want to consider going back to where your read in the data and use custom informats that check the valid ranges. I use this for several projects to check the data before going on to the next steps. If your "error" rate is low this may be viable solution.

If you want a record of the log results use Proc Printto to create file of all of the error  messages.

 

proc format library=work;
invalue p0001in
.,1,2,3,4,5,6,7 = _same_
other = _error_;
run;

data example;
   informat id best5. p0001 p0001in.;
   input id p0001;
datalines;
1 .
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 0
10 1.5
11 444
;
run;

which generates a log like:

 

55   data example;
56      informat id best5. p0001 p0001in.;
57      input id p0001;
58   datalines;

NOTE: Invalid data for p0001 in line 67 3-3.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
67         9 0
id=9 p0001=. _ERROR_=1 _N_=9
NOTE: Invalid data for p0001 in line 68 4-6.
68         10 1.5
id=10 p0001=. _ERROR_=1 _N_=10
NOTE: Invalid data for p0001 in line 69 4-6.
69         11 444
id=11 p0001=. _ERROR_=1 _N_=11
NOTE: The data set USER.EXAMPLE has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds

The informat approach has a couple of advantages:

 

1) variable with identical data limits can share the same informat

2) the resulting data set will not have any erroneous values (except missing if that is an error)

3) records with multiple errors will all be together in the log. If I had 100+ variables and variable p0001 and p0100 had errors

the log would show something like:

NOTE: Invalid data for p0001 in line 68 4-6.
      Invalid data for p0100 in line 68 245-247

so I wouldn't be looking all over for where multiple errors occur on the same line.

 

4) sometimes this yields enough information that you find out the file format is not as expected

 

Ranges in Proc format can be done with

20 - 40 =

or

0 < 1, 50>high

Occasional Contributor
Posts: 16

Re: Output data errors with 193 criteria

The Errors are printed by Batch#, pval, and ID/Obs then given to someone to open the key entry-text datafile, find the error, correct it and save.
Grand Advisor
Posts: 10,239

Re: Output data errors with 193 criteria

The above does show the P variable and line number in the file. Id will be one  of the many variables with a value in the diagnostics. This can produce a lot of text if you have errors on many rows.

 

You did not mention "batch#" or show how that would be know in your first post. So I'm not sure how to include that information.

 

Ask a Question
Discussion stats
  • 9 replies
  • 158 views
  • 0 likes
  • 5 in conversation