BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
faspr
Fluorite | Level 6

If a text file has lines that do not have as many attributes as it should, how do I exclude those observations only?

 

I'm okay with missing values, just not okay with lines that don't have the correct number of attributes. Some cases it shifts the last attribute into the second to last column which can cause an invalid data problem. Those are easier to catch; however, the ones I cannot catch are when it shifts but does not cause an invalid data error.

 

I have attached a sample example as well as one of my read in attempts. The exact lines I'd like to exclude are 2,4 & 7.

 

Any help would be appreciated, thank you.

 

SAS Enterprise Guide 6.1

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Sounds like you want to count the number of fields on the line?

  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  nfields=countw(_infile_,',','qm');
Obs    nrec    nmiss    nfields

 1       1       1         6
 2       2       3         5
 3       3       0         6
 4       4       4         5
 5       5       1         6
 6       6       4         6
 7       7       3         5
 8       8       1         6

View solution in original post

15 REPLIES 15
Reeza
Super User

Post the information directly into the forum please. 

 

Look at the TRUNCOVER, MISSOVER options on the INFILE statement.

faspr
Fluorite | Level 6

I'm not sure what your first statement means, but yes I have tried all of the infile options. 

 

None of them help to solve the issue.

faspr
Fluorite | Level 6

text file:

REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F

 

One of the code attempts:

DATA EXAMP ;
INFILE "Example File.txt"
DELIMITER = ','
DSD
TRUNCOVER
FIRSTOBS=2
;

INFORMAT RECORD_NUM BEST1.;
INFORMAT FIRST_NM $CHAR20.;
INFORMAT LAST_NM $CHAR20.;
INFORMAT COMPANY_NM $CHAR20.;
INFORMAT AGE BEST2.;
INFORMAT GENDER $CHAR1.;

FORMAT RECORD_NUM BEST1.;
FORMAT FIRST_NM $CHAR20.;
FORMAT LAST_NM $CHAR20.;
FORMAT COMPANY_NM $CHAR20.;
FORMAT AGE BEST2.;
FORMAT GENDER $CHAR1.;

INPUT
RECORD_NUM
FIRST_NM $
LAST_NM $
COMPANY_NM $
AGE
GENDER $
;
NREC =_N_;
ERR = _ERROR_;
RUN;

Reeza
Super User

I only get one error and it's because your data doesn't follow the format required, not because of an issue with length. 

 

Record Number = 2

First Name = Jerry

Last Name = Jones

Company Name = Missing

Age = M? -> but this is supposed to be a number

Gender = M -> but this isn't the right place. 

 

2,JERRY,JONES,,M

 

Changing it to the follow (adding a comma) allows it to be read correctly. 

2,JERRY,JONES,,,M

 

If your data is actually like this you have a problem, but it has nothing to do with the length. If the structure is inconsistent how can you determine which field is what?

faspr
Fluorite | Level 6

Yes, the file was sent to us with corrupt lines (.01% of observations).

 

It's a 350 million record file, so in the end I'd like to flag short records on import to remove them later or exclude them completely in the import.

Tom
Super User Tom
Super User

Sounds like you don't care if COMAPNY_NM is missing. So perhaps you want something like this?  (Note: I put the data inline to make it easier to test and share.)

data want ;
  infile cards dsd truncover firstobs=2;
  length nrec nmiss REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
  input REC_NUM -- GENDER ;
  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  if nmiss > 1 then do;
    put 'NOTE: Missing data. ' ;
    _error_=1;
  end;
cards;
REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F
;;;;
NOTE: Invalid data for AGE in line 427 16-16.
NOTE: Missing data.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----
427        2,JERRY,JONES,,M
nrec=2 nmiss=3 REC_NUM=2 FIRST_NM=JERRY LAST_NM=JONES COMPANY_NM=  AGE=. GENDER=  _ERROR_=1 _N_=2
NOTE: Missing data.
429        4,,,"COMPANY C,INC",
nrec=4 nmiss=4 REC_NUM=4 FIRST_NM=  LAST_NM=  COMPANY_NM=COMPANY C,INC AGE=. GENDER=  _ERROR_=1
_N_=4
NOTE: Missing data.
431        6,,,COMPANY B,,
nrec=6 nmiss=4 REC_NUM=6 FIRST_NM=  LAST_NM=  COMPANY_NM=COMPANY B AGE=. GENDER=  _ERROR_=1 _N_=6
NOTE: Missing data.
432        7,ALICE,COOPER,,
nrec=7 nmiss=3 REC_NUM=7 FIRST_NM=ALICE LAST_NM=COOPER COMPANY_NM=  AGE=. GENDER=  _ERROR_=1 _N_=7
NOTE: The data set WORK.WANT has 8 observations and 8 variables.

 

ballardw
Super User

What is the generic rule for excluding the records? I am not sure why line 6 in your example data is not excluded:

REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F 

Or did "line" refer" to position instead of the value for Rec_Num? Also not clear.

 

A quick look tells me your file is malformed. Your rec_num =2 apparently has the gender value in the age position, and 4,6 and 7 all have the wrong number of fields.

 

If this was one of my projects I would go back to the data source and have them clean up the file to match the expected layout.

Otherwise there is going to be a lot of data parsing and I don't have enough time for this type of data when $$ is not involved.

faspr
Fluorite | Level 6

I only wish to exclude the short records (records that don't have all six columns) as the vendor sent a bad file. But the short records only make up .01% of the observations.

 

The vendor is slow moving, so instead of waiting a week or two for a corrected file, I need to find a workaround to eliminate those .01% of records. And since it's 250 million records, I can't self identify the short lines/records.

 

 

Record 6 is fine as there are six columns there, just missing values which are okay.

Tom
Super User Tom
Super User

Sounds like you want to count the number of fields on the line?

  nrec+1;
  nmiss=cmiss(of REC_NUM -- GENDER);
  nfields=countw(_infile_,',','qm');
Obs    nrec    nmiss    nfields

 1       1       1         6
 2       2       3         5
 3       3       0         6
 4       4       4         5
 5       5       1         6
 6       6       4         6
 7       7       3         5
 8       8       1         6
faspr
Fluorite | Level 6

Yes, but I'm not sure how to incorporate the code you've provided into my original query. 

Tom
Super User Tom
Super User

Just add it to the data step.

Don't make the data step so complicated.  PROC IMPORT writes really ugly SAS code.

To read a delimited file you just need to define the variables and then read them. If you have any dates or times then would need to define informats and formats for those variables, but most numeric and character variables do not need (or want) to have formats or informats attached to them.

 

data want ;
  infile 'myfile.csv' dsd truncover firstobs=2;
  length nrec nfields REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
  input REC_NUM -- GENDER ;
  nrec+1;
  nfields=countw(_infile_,',','qm');
run;

So you might then want to use SAS code to look at this new NFIELDS variable to check the quality.

proc univariate ;
  id nrec ;
  var nfields;
run;
proc freq ;
 tables nfields ;
run;

 

faspr
Fluorite | Level 6

This seems to do the trick for the sample I provided, thank you Tom.

 

Unfortunately, it doesn't work 100% with my main file, but this is a great start. Thank you!

faspr
Fluorite | Level 6

Tom,

 

Do you know a workaround for if this was also a value? 

 

9,HANNAH,QU,HANNAH'S BAKERY INC.,46,F

 

faspr
Fluorite | Level 6

Got it:

 

nfields=countw(compress(_infile_,"'"),',','mq');

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1997 views
  • 2 likes
  • 4 in conversation