Marking columns that fail informats on input steps?

Accepted Solution Solved
Reply
Super Contributor
Posts: 418
Accepted Solution

Marking columns that fail informats on input steps?

Hello all. Had a quick question for anyone who might be able to answer it!

When using an input statement and a list of informats (similar to the example given below) is there a way to output a dataset that contains the columns that contained invalid data? I know that you can create a new variable using the _error_ variable in order to flag the rows in which there was bad data, however I do not know if such a thing exists for a given column. In the example below it would be the date column that could be formatted incorrectly (people fat finger things and type dates like 190/01/1993, etc..

I basically have to read in a file with 100+ columns (could be many more), and it is based upon user entered Excel data, in which many mistakes can and will occur. On top of this, I have to do it for numerous files (50-200+). Finally this is something that is run overnight or over the weekend depending on the requirments. Because of all of these factors, having to read through the Log file is extremely inefficient.

If anyone know of a method for flagging the columns with invalid informat data, please let me know!

Informat Example

data work.MyExample

infile "C:\my Location\myfile.csv" lrecl=3200 missover dsd firstobs=2;

informat variable1 $500.;

informat variable2 MMDDYY10.;

informat variable3 dollar32.;

format variable1 $500;

format variable2 MMDDYY10.;

format variable3 best32.;

input variable1 $

variable2

variable3;

run;


Accepted Solutions
Solution
‎12-31-2012 12:30 PM
PROC Star
Posts: 7,487

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

You can always reroute and then analyze the log.  e.g.:

proc printto log='c:\testit.txt' new;

run;

data have;

  informat id $3.;

  informat date1 date2 date.;

  format date1 date2 date.;

  input id date1 date2;

  cards;

001 10jan2012 10jan2012

001 31dec2012 .

001 99mar2012 88apr2012

002 . 32apr2012

002 29dec2012 909mar2012

;

proc printto;

run;

data errors;

  length errors $255;

  infile "c:\testit.txt";

  call missing(errors);

  input;

  if find(_infile_,"NOTE: Invalid data for ") then do;

    errors=scan(substr(_infile_,24),1);

    do until (find(_infile_,"_N_"));

      input;

      if find(_infile_,"NOTE: Invalid data for ") then

        errors=catx(" ",errors,scan(substr(_infile_,24),1));

      else if find(_infile_,"_N_") then do;

        errors=catx(" ",substr(_infile_,find(_infile_,"_N_")),errors);

        output;

      end;

    end;

  end;

run;

which would produce the following file:

1_N_=3 date1 date2
2_N_=4 date2
3_N_=5 date2

View solution in original post


All Replies
Frequent Contributor
Posts: 86

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

Hi

If we specify some format and sas found invalid data for that row, it will set it to missing. In that case, you can create a separate process for flagging the variables where there are stances of missing data.

There might be better ways to do. I can do it like

proc means data = temp  noprint;

var _all_;

output out = temp1 (Drop = _TYPE_ _FREQ_ ) nmiss =;

run;

:

Super Contributor
Posts: 418

Re: Marking columns that fail informats on input steps?

Would this not also flag varialbes that have naturally occuring missing data? If So (I believe it does), this is sadly a problem as each column that I wish to flag could also have missing data, and I will not know the number of instances of missing data per column.

If I am mistaken please let me know.

Solution
‎12-31-2012 12:30 PM
PROC Star
Posts: 7,487

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

You can always reroute and then analyze the log.  e.g.:

proc printto log='c:\testit.txt' new;

run;

data have;

  informat id $3.;

  informat date1 date2 date.;

  format date1 date2 date.;

  input id date1 date2;

  cards;

001 10jan2012 10jan2012

001 31dec2012 .

001 99mar2012 88apr2012

002 . 32apr2012

002 29dec2012 909mar2012

;

proc printto;

run;

data errors;

  length errors $255;

  infile "c:\testit.txt";

  call missing(errors);

  input;

  if find(_infile_,"NOTE: Invalid data for ") then do;

    errors=scan(substr(_infile_,24),1);

    do until (find(_infile_,"_N_"));

      input;

      if find(_infile_,"NOTE: Invalid data for ") then

        errors=catx(" ",errors,scan(substr(_infile_,24),1));

      else if find(_infile_,"_N_") then do;

        errors=catx(" ",substr(_infile_,find(_infile_,"_N_")),errors);

        output;

      end;

    end;

  end;

run;

which would produce the following file:

1_N_=3 date1 date2
2_N_=4 date2
3_N_=5 date2
Super Contributor
Posts: 418

Re: Marking columns that fail informats on input steps?

Hello Arthur! This is a very helpful and useful start, thank you. This is pretty much exactly what I was looking for. The only issue I am having with this process is that the end dataset "errors" is printing out every single variable in a line that preceeds the main incorrect variable. Ex. It is printing

_n_=3 variable1 variable2 variable3, when I would only like to see variable 3.

This very well might be what you intended, but I need a bit more time to look through your code to figure that part out (not a sas expert by any means!) If this is indeed the case I will mark your answer as correct. By the way, very cool trick! was not aware of proc printto;

Thanks

Brandon

PROC Star
Posts: 7,487

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

You can limit what you are searching for.  e.g.:

proc printto log='c:\art\testit.txt' new;

run;

data have;

  informat id $3.;

  informat date1 date2 date.;

  format date1 date2 date.;

  input id date1 date2;

  cards;

001 10jan2012 10jan2012

001 31dec2012 .

001 99mar2012 88apr2012

002 . 32apr2012

002 29dec2012 909mar2012

;

proc printto;

run;

data errors;

  length errors $255;

  infile "c:\art\testit.txt";

  call missing(errors);

  input;

  if find(_infile_,"NOTE: Invalid data for date2") then do;

    errors=scan(substr(_infile_,24),1);

    do until (find(_infile_,"_N_"));

      input;

      if find(_infile_,"NOTE: Invalid data for date2") then

        errors=catx(" ",errors,scan(substr(_infile_,24),1));

      else if find(_infile_,"_N_") then do;

        errors=catx(" ",substr(_infile_,find(_infile_,"_N_")),errors);

        output;

      end;

    end;

  end;

run;

proc print data=errors;

run;

Super Contributor
Posts: 418

Re: Marking columns that fail informats on input steps?

Hello again Arthur. I was able to read through your code and amke a quick minor change, and the following change is what I came up with.

Basically I just added a scan to the final else if, that way only the _n_=XX would be picked up. This would give me the exact row and colums that I needed.

I do have one final question if you don't mind answering it. I believe what you are doing in this process (at least from what I understood) was first looking at each row, and then if the row contained "NOTE: Invalid data for ", then you were scanning EVERY row between that row, and the next row that included "_n_". SO basically every row scans every row beneath it in the Do until loop, until it reaches the _n_. That is a very clever trick if I am understanding it correctly.

Thanks Brandon

 

data errors1;

length errors $1000;

infile "I:\Projects\Consulting - Analytics\validation_data\Commercial\testit.txt";

call missing(errors);

input;

if find(_infile_,"NOTE: Invalid data for ") then do;

errors=scan(substr(_infile_,24),1);

do until (find(_infile_,"_N_"));

input;

if find(_infile_,"NOTE: Invalid data for ") then

errors=catx(" ",errors,scan(substr(_infile_,24),1));

else if find(_infile_,"_N_") then do;

errors=catx(" ",scan(substr(_infile_,find(_infile_,"_N_")),1),errors);

output;

end;

end;

end;

Super User
Super User
Posts: 7,074

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

One way is the read each cell independently and convert the value. Then you can test for the error when it occurs.  Here is example based on Art's data.  You could use a metadata table that lists the variables expected to generate the code rather than typing it out each block of code for all 50-100 variables per data type.

data want(drop=row col cell informat)

     errors (keep=row col cell informat)

  ;

  length row 8 col $32 cell $500 informat $14;

  infile cards dsd dlm=',' truncover firstobs=2;

row+1;

col='id';

input cell @ ;

informat= '$3.';

id = inputc(cell,informat);

if _error_ then output errors;

_error_=0;

col='date1' ;

input cell @ ;

informat= 'date9.';

date1 = inputn(cell,informat);

if _error_ then output errors ;

_error_=0;

col='date2' ;

input cell @ ;

informat= 'date9.';

date2 = inputn(cell,informat);

if _error_ then output errors ;

_error_=0;

output want ;

format date1 date2 date.;

cards;

id,date1,date2

001,10jan2012,10jan2012

001,31dec2012,.

001,99mar2012,88apr2012

002,.,32apr2012

002,29dec2012,909mar2012

run;

proc print data=errors width=min;

run;


Obs   row     col        cell       informat

1      3     date1    99mar2012      date9.

2      3     date2    88apr2012      date9.

3      4     date2    32apr2012      date9.

4      5     date2    909mar2012     date9.

PROC Star
Posts: 7,487

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

First, yes, it sounds like you correctly understand what the code is doing.  However, since you only want to trap errors if they occur for one specific variable, the code can be simplified quite a bit.  Consider something like the following:

proc printto log='c:\testit.txt' new;

run;

data have;

  informat id $3.;

  informat date1 date2 date.;

  format date1 date2 date.;

  input id date1 date2;

  cards;

001 10jan2012 10jan2012

001 31dec2012 .

001 99mar2012 88apr2012

002 . 32apr2012

002 29dec2012 909mar2012

;

proc printto;

run;

data errors;

  infile "c:\testit.txt";

  input;

  if find(upcase(_infile_),"NOTE: INVALID DATA FOR DATE2") then do;

    input @ "_N_=" n;

    output;

  end;

run;

proc print data=errors;

run;

Super Contributor
Posts: 418

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream


Oh No I'm sorry, I did want to trap errors for all of the variables that are caught, and the code that I posted above does just this. It lists all X variables that had errors, along with just the values of the _n_ row.  The problem I was having with the original code was that it was grabbing the entire first row of the _n_=XXX section of the log, and I only wanted the _n_=XXX portion of it, but that was a very simple fix!

Your code worked like a dream! Thanks very much

Brandon

Valued Guide
Posts: 2,177

Re: Marking columns that fail informats on input steps?

Posted in reply to Anotherdream

why wait to read and decode the log?

could this error discovery not be made as each row is read?

In the normal pass, read all columns, but hold the line with a trailing '@'.

If _ERROR_ becomes non-zero, return to the beginning of the line and loop through each column+informat to discover each of the columns which have caused errors on that input line. For each cell which caused _error_ to be set, output an errors file with rownum, column num and input line position, input string and any message that you can trap

Of course this depends on returning to the start of the line, so no flowover or scanover process could be supported.

It might make an interesting metadata and macro application!

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 365 views
  • 0 likes
  • 5 in conversation