Solved: Re: How to exclude short lines in text infile?

faspr · Posted 10-13-2017 03:48 PM

If a text file has lines that do not have as many attributes as it should, how do I exclude those observations only?

I'm okay with missing values, just not okay with lines that don't have the correct number of attributes. Some cases it shifts the last attribute into the second to last column which can cause an invalid data problem. Those are easier to catch; however, the ones I cannot catch are when it shifts but does not cause an invalid data error.

I have attached a sample example as well as one of my read in attempts. The exact lines I'd like to exclude are 2,4 & 7.

Any help would be appreciated, thank you.

SAS Enterprise Guide 6.1

Tom · Posted 10-13-2017 04:51 PM

Sounds like you want to count the number of fields on the line?

  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  nfields=countw(_infile_,',','qm');

Obs    nrec    nmiss    nfields

 1       1       1         6
 2       2       3         5
 3       3       0         6
 4       4       4         5
 5       5       1         6
 6       6       4         6
 7       7       3         5
 8       8       1         6

View solution in original post

Reeza · Posted 10-13-2017 03:51 PM

Post the information directly into the forum please.

Look at the TRUNCOVER, MISSOVER options on the INFILE statement.

faspr · Posted 10-13-2017 03:52 PM

I'm not sure what your first statement means, but yes I have tried all of the infile options.

None of them help to solve the issue.

faspr · Posted 10-13-2017 03:55 PM

text file:

REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F

One of the code attempts:

DATA EXAMP ;
INFILE "Example File.txt"
DELIMITER = ','
DSD
TRUNCOVER
FIRSTOBS=2
;

INFORMAT RECORD_NUM BEST1.;
INFORMAT FIRST_NM $CHAR20.;
INFORMAT LAST_NM $CHAR20.;
INFORMAT COMPANY_NM $CHAR20.;
INFORMAT AGE BEST2.;
INFORMAT GENDER $CHAR1.;

FORMAT RECORD_NUM BEST1.;
FORMAT FIRST_NM $CHAR20.;
FORMAT LAST_NM $CHAR20.;
FORMAT COMPANY_NM $CHAR20.;
FORMAT AGE BEST2.;
FORMAT GENDER $CHAR1.;

INPUT
RECORD_NUM
FIRST_NM $
LAST_NM $
COMPANY_NM $
AGE
GENDER $
;
NREC =_N_;
ERR = _ERROR_;
RUN;

Reeza · Posted 10-13-2017 04:13 PM

I only get one error and it's because your data doesn't follow the format required, not because of an issue with length.

Record Number = 2

First Name = Jerry

Last Name = Jones

Company Name = Missing

Age = M? -> but this is supposed to be a number

Gender = M -> but this isn't the right place.

2,JERRY,JONES,,M

Changing it to the follow (adding a comma) allows it to be read correctly.

2,JERRY,JONES,,,M

If your data is actually like this you have a problem, but it has nothing to do with the length. If the structure is inconsistent how can you determine which field is what?

faspr · Posted 10-13-2017 04:38 PM

Yes, the file was sent to us with corrupt lines (.01% of observations).

It's a 350 million record file, so in the end I'd like to flag short records on import to remove them later or exclude them completely in the import.

Tom · Posted 10-13-2017 04:12 PM

Sounds like you don't care if COMAPNY_NM is missing. So perhaps you want something like this? (Note: I put the data inline to make it easier to test and share.)

data want ;
  infile cards dsd truncover firstobs=2;
  length nrec nmiss REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
  input REC_NUM -- GENDER ;
  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  if nmiss > 1 then do;
    put 'NOTE: Missing data. ' ;
    _error_=1;
  end;
cards;
REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F
;;;;

NOTE: Invalid data for AGE in line 427 16-16.
NOTE: Missing data.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----
427        2,JERRY,JONES,,M
nrec=2 nmiss=3 REC_NUM=2 FIRST_NM=JERRY LAST_NM=JONES COMPANY_NM=  AGE=. GENDER=  _ERROR_=1 _N_=2
NOTE: Missing data.
429        4,,,"COMPANY C,INC",
nrec=4 nmiss=4 REC_NUM=4 FIRST_NM=  LAST_NM=  COMPANY_NM=COMPANY C,INC AGE=. GENDER=  _ERROR_=1
_N_=4
NOTE: Missing data.
431        6,,,COMPANY B,,
nrec=6 nmiss=4 REC_NUM=6 FIRST_NM=  LAST_NM=  COMPANY_NM=COMPANY B AGE=. GENDER=  _ERROR_=1 _N_=6
NOTE: Missing data.
432        7,ALICE,COOPER,,
nrec=7 nmiss=3 REC_NUM=7 FIRST_NM=ALICE LAST_NM=COOPER COMPANY_NM=  AGE=. GENDER=  _ERROR_=1 _N_=7
NOTE: The data set WORK.WANT has 8 observations and 8 variables.

ballardw · Posted 10-13-2017 04:20 PM

What is the generic rule for excluding the records? I am not sure why line 6 in your example data is not excluded:

REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F

Or did "line" refer" to position instead of the value for Rec_Num? Also not clear.

A quick look tells me your file is malformed. Your rec_num =2 apparently has the gender value in the age position, and 4,6 and 7 all have the wrong number of fields.

If this was one of my projects I would go back to the data source and have them clean up the file to match the expected layout.

Otherwise there is going to be a lot of data parsing and I don't have enough time for this type of data when $$ is not involved.

faspr · Posted 10-13-2017 04:44 PM

I only wish to exclude the short records (records that don't have all six columns) as the vendor sent a bad file. But the short records only make up .01% of the observations.

The vendor is slow moving, so instead of waiting a week or two for a corrected file, I need to find a workaround to eliminate those .01% of records. And since it's 250 million records, I can't self identify the short lines/records.

Record 6 is fine as there are six columns there, just missing values which are okay.

Tom · Posted 10-13-2017 04:51 PM

Sounds like you want to count the number of fields on the line?

  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  nfields=countw(_infile_,',','qm');

Obs    nrec    nmiss    nfields

 1       1       1         6
 2       2       3         5
 3       3       0         6
 4       4       4         5
 5       5       1         6
 6       6       4         6
 7       7       3         5
 8       8       1         6

faspr · Posted 10-13-2017 04:59 PM

Yes, but I'm not sure how to incorporate the code you've provided into my original query.

Tom · Posted 10-13-2017 05:03 PM

Just add it to the data step.

Don't make the data step so complicated. PROC IMPORT writes really ugly SAS code.

To read a delimited file you just need to define the variables and then read them. If you have any dates or times then would need to define informats and formats for those variables, but most numeric and character variables do not need (or want) to have formats or informats attached to them.

data want ;
  infile 'myfile.csv' dsd truncover firstobs=2;
  length nrec nfields REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
  input REC_NUM -- GENDER ;
  nrec+1;
  nfields=countw(_infile_,',','qm');
run;

So you might then want to use SAS code to look at this new NFIELDS variable to check the quality.

proc univariate ;
  id nrec ;
  var nfields;
run;
proc freq ;
 tables nfields ;
run;

faspr · Posted 10-13-2017 05:12 PM

This seems to do the trick for the sample I provided, thank you Tom.

Unfortunately, it doesn't work 100% with my main file, but this is a great start. Thank you!

faspr · Posted 10-13-2017 05:26 PM

Tom,

Do you know a workaround for if this was also a value?

9,HANNAH,QU,HANNAH'S BAKERY INC.,46,F

faspr · Posted 10-13-2017 05:35 PM

Got it:

nfields=countw(compress(_infile_,"'"),',','mq');

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away