If a text file has lines that do not have as many attributes as it should, how do I exclude those observations only?
I'm okay with missing values, just not okay with lines that don't have the correct number of attributes. Some cases it shifts the last attribute into the second to last column which can cause an invalid data problem. Those are easier to catch; however, the ones I cannot catch are when it shifts but does not cause an invalid data error.
I have attached a sample example as well as one of my read in attempts. The exact lines I'd like to exclude are 2,4 & 7.
Any help would be appreciated, thank you.
SAS Enterprise Guide 6.1
Sounds like you want to count the number of fields on the line?
nrec+1;
nmiss=cmiss(of REC_NUM -- GENDER);
nfields=countw(_infile_,',','qm');
Obs nrec nmiss nfields 1 1 1 6 2 2 3 5 3 3 0 6 4 4 4 5 5 5 1 6 6 6 4 6 7 7 3 5 8 8 1 6
Post the information directly into the forum please.
Look at the TRUNCOVER, MISSOVER options on the INFILE statement.
I'm not sure what your first statement means, but yes I have tried all of the infile options.
None of them help to solve the issue.
text file:
REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F
One of the code attempts:
DATA EXAMP ;
INFILE "Example File.txt"
DELIMITER = ','
DSD
TRUNCOVER
FIRSTOBS=2
;
INFORMAT RECORD_NUM BEST1.;
INFORMAT FIRST_NM $CHAR20.;
INFORMAT LAST_NM $CHAR20.;
INFORMAT COMPANY_NM $CHAR20.;
INFORMAT AGE BEST2.;
INFORMAT GENDER $CHAR1.;
FORMAT RECORD_NUM BEST1.;
FORMAT FIRST_NM $CHAR20.;
FORMAT LAST_NM $CHAR20.;
FORMAT COMPANY_NM $CHAR20.;
FORMAT AGE BEST2.;
FORMAT GENDER $CHAR1.;
INPUT
RECORD_NUM
FIRST_NM $
LAST_NM $
COMPANY_NM $
AGE
GENDER $
;
NREC =_N_;
ERR = _ERROR_;
RUN;
I only get one error and it's because your data doesn't follow the format required, not because of an issue with length.
Record Number = 2
First Name = Jerry
Last Name = Jones
Company Name = Missing
Age = M? -> but this is supposed to be a number
Gender = M -> but this isn't the right place.
2,JERRY,JONES,,M
Changing it to the follow (adding a comma) allows it to be read correctly.
2,JERRY,JONES,,,M
If your data is actually like this you have a problem, but it has nothing to do with the length. If the structure is inconsistent how can you determine which field is what?
Yes, the file was sent to us with corrupt lines (.01% of observations).
It's a 350 million record file, so in the end I'd like to flag short records on import to remove them later or exclude them completely in the import.
Sounds like you don't care if COMAPNY_NM is missing. So perhaps you want something like this? (Note: I put the data inline to make it easier to test and share.)
data want ;
infile cards dsd truncover firstobs=2;
length nrec nmiss REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
input REC_NUM -- GENDER ;
nrec+1;
nmiss=cmiss(of REC_NUM -- GENDER);
if nmiss > 1 then do;
put 'NOTE: Missing data. ' ;
_error_=1;
end;
cards;
REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER
1,JOHN,SMITH,,36,M
2,JERRY,JONES,,M
3,JASON,JACKSON,COMPANY A,45,M
4,,,"COMPANY C,INC",
5,HOLLY,SAMPSON,,76,F
6,,,COMPANY B,,
7,ALICE,COOPER,,
8,GRACE,JOHNSON,,33,F
;;;;
NOTE: Invalid data for AGE in line 427 16-16. NOTE: Missing data. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---- 427 2,JERRY,JONES,,M nrec=2 nmiss=3 REC_NUM=2 FIRST_NM=JERRY LAST_NM=JONES COMPANY_NM= AGE=. GENDER= _ERROR_=1 _N_=2 NOTE: Missing data. 429 4,,,"COMPANY C,INC", nrec=4 nmiss=4 REC_NUM=4 FIRST_NM= LAST_NM= COMPANY_NM=COMPANY C,INC AGE=. GENDER= _ERROR_=1 _N_=4 NOTE: Missing data. 431 6,,,COMPANY B,, nrec=6 nmiss=4 REC_NUM=6 FIRST_NM= LAST_NM= COMPANY_NM=COMPANY B AGE=. GENDER= _ERROR_=1 _N_=6 NOTE: Missing data. 432 7,ALICE,COOPER,, nrec=7 nmiss=3 REC_NUM=7 FIRST_NM=ALICE LAST_NM=COOPER COMPANY_NM= AGE=. GENDER= _ERROR_=1 _N_=7 NOTE: The data set WORK.WANT has 8 observations and 8 variables.
What is the generic rule for excluding the records? I am not sure why line 6 in your example data is not excluded:
REC_NUM,FIRST_NM,LAST_NM,COMPANY_NM,AGE,GENDER 1,JOHN,SMITH,,36,M 2,JERRY,JONES,,M 3,JASON,JACKSON,COMPANY A,45,M 4,,,"COMPANY C,INC", 5,HOLLY,SAMPSON,,76,F 6,,,COMPANY B,, 7,ALICE,COOPER,, 8,GRACE,JOHNSON,,33,F
Or did "line" refer" to position instead of the value for Rec_Num? Also not clear.
A quick look tells me your file is malformed. Your rec_num =2 apparently has the gender value in the age position, and 4,6 and 7 all have the wrong number of fields.
If this was one of my projects I would go back to the data source and have them clean up the file to match the expected layout.
Otherwise there is going to be a lot of data parsing and I don't have enough time for this type of data when $$ is not involved.
I only wish to exclude the short records (records that don't have all six columns) as the vendor sent a bad file. But the short records only make up .01% of the observations.
The vendor is slow moving, so instead of waiting a week or two for a corrected file, I need to find a workaround to eliminate those .01% of records. And since it's 250 million records, I can't self identify the short lines/records.
Record 6 is fine as there are six columns there, just missing values which are okay.
Sounds like you want to count the number of fields on the line?
nrec+1;
nmiss=cmiss(of REC_NUM -- GENDER);
nfields=countw(_infile_,',','qm');
Obs nrec nmiss nfields 1 1 1 6 2 2 3 5 3 3 0 6 4 4 4 5 5 5 1 6 6 6 4 6 7 7 3 5 8 8 1 6
Yes, but I'm not sure how to incorporate the code you've provided into my original query.
Just add it to the data step.
Don't make the data step so complicated. PROC IMPORT writes really ugly SAS code.
To read a delimited file you just need to define the variables and then read them. If you have any dates or times then would need to define informats and formats for those variables, but most numeric and character variables do not need (or want) to have formats or informats attached to them.
data want ;
infile 'myfile.csv' dsd truncover firstobs=2;
length nrec nfields REC_NUM 8 FIRST_NM $20 LAST_NM $20 COMPANY_NM $20 AGE 8 GENDER $1 ;
input REC_NUM -- GENDER ;
nrec+1;
nfields=countw(_infile_,',','qm');
run;
So you might then want to use SAS code to look at this new NFIELDS variable to check the quality.
proc univariate ;
id nrec ;
var nfields;
run;
proc freq ;
tables nfields ;
run;
This seems to do the trick for the sample I provided, thank you Tom.
Unfortunately, it doesn't work 100% with my main file, but this is a great start. Thank you!
Tom,
Do you know a workaround for if this was also a value?
9,HANNAH,QU,HANNAH'S BAKERY INC.,46,F
Got it:
nfields=countw(compress(_infile_,"'"),',','mq');
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.