DATA Step, Macro, Functions and more

INFILE problem not grabbing all records from csv, field has quotation marks

Reply
Occasional Contributor
Posts: 9

INFILE problem not grabbing all records from csv, field has quotation marks

Hello, 

 

I am having an issue with the INFILE section in the Data step using Base SAS 9.4 by way of Enterprise Guide. The file that I am importing in question has 1523 records (not including column names on first row), but the statement is only pulling in 144 from the csv. When I look at the dataset that the code has produced, there exists a field where the quotation marks that surround each field in a csv pop out for whatever reason. When I check the csv, there is an odd character (an arrow) that is in the field where the issue starts.

 

Now, I wouldn't mind tossing the record itself later in the program, but the file completely stops reading at that claim. I am confident that the informants and lengths and such are correct because I used the SAS EG Import option to do the heavy lifting on that end. When I read the logs, the following is found, suggesting that there was no error (I am using dummy names to obscure potentially sensitive information):

 

NOTE: The infile "location\file.csv" is:

Filename=location,

RECFM=V,LRECL=32767,File Size (bytes)=1734349,

Last Modified=16May2018:05:59:01,

Create Time=16May2018:05:41:48

NOTE: 144 records were read from the infile "location\file.csv".

The minimum record length was 876.

The maximum record length was 1491.

NOTE: The data set WORK.DATASET has 144 observations and 154 variables.

NOTE: DATA statement used (Total process time):

real time 0.42 seconds

cpu time 0.07 seconds

 

Here is a truncated version of the Data step code that I am using. X is the field preceding the problem field, and Y is the problem field. I will note that field x is not populated anywhere in the document, so there should be nothing in field X to cause field Y to break.

 

data work.dataset;

length

X $ 1

Y $ 1492

;

label

X = "X"

Y = "Y"

;

format

X $CHAR1.

Y $CHAR1492.

;

informat

X $CHAR1.

Y $CHAR1492.

;

INFILE "location\file.csv"

firstobs=2

MISSOVER

DSD

;

input

X : $CHAR1.

Y : $CHAR1492.

;

run;

 

So, to summarize my question: does anyone know why the csv may be breaking, and what can I do to get all of the records in, even if some of them are broken?

 

 

Super User
Posts: 13,293

Re: INFILE problem not grabbing all records from csv, field has quotation marks

You have likely already identified the problem:

When I check the csv, there is an odd character (an arrow) that is in the field where the issue starts.

You don't say how you examined the CSV but unusual characters either from foreign language encoding or entry may be getting treated as some special character by your operating system, such as a ctrl-z may be treated as an "end of file".

 

 

If there is only once of these you might try deleting the odd character using a plain text editor such as Notepad or similar.

Occasional Contributor
Posts: 9

Re: INFILE problem not grabbing all records from csv, field has quotation marks

I appreciate the reply.

 

I deleted the character in the file, and everything loaded correctly. However, this is a program that is going to be used on new data per a monthly basis, so is there anything that can be done on the programming side to avoid the issue? I would like to avoid manually deleting a few characters every time that new data is pulled.

Super User
Posts: 23,237

Re: INFILE problem not grabbing all records from csv, field has quotation marks

1. Figure out why character is in file - is it encoding, different language, etc?

2. Ask data provider to modify data - this does actually work if the file has inconsistencies, such as this.

3. Use SAS or PowerShell or any language to read in the file, find and replace the character. There's examples of this on here, usually from Tom Smiley Happy. Then you can use the code above to read it. 

 

 

Super User
Posts: 13,293

Re: INFILE problem not grabbing all records from csv, field has quotation marks


@BozJ3 wrote:

I appreciate the reply.

 

I deleted the character in the file, and everything loaded correctly. However, this is a program that is going to be used on new data per a monthly basis, so is there anything that can be done on the programming side to avoid the issue? I would like to avoid manually deleting a few characters every time that new data is pulled.


If the data field corresponds to something that users enter free-form text into, such as a comment box, case notes or similar this may be an issue that doesn't occur often. It may be that letting the source of the file know that "unexpected" characters are appearing in one or more fields may help them adjust the creation of the extract to exclude such or filter at an earlier stage of processing.

 

If the field is the result of some operation then it may help to file provider clean up a process.

 

And sometimes files get corrupted from any number of causes. You might verify file size, number of records and bytes both with the file provider to see if what you see is what was intended.

Ask a Question
Discussion stats
  • 4 replies
  • 92 views
  • 0 likes
  • 3 in conversation