Error importing line in CSV (Invalid data for *variable* in line *line... - Page 3

Autotelic · Posted 11-06-2017 09:55 AM

In Notepad++, going to Plugins\Converter\ASCII -> HEX, didn't make any chances at the beginning of the file. Any other way I can check this?

TomKari · Posted 11-06-2017 10:11 AM

I thought perhaps I could use SAS to pull them out, so I created one notepad file with default encoding, and one with UTF8. I then ran the following code:

data _null_;
infile "dir\TestRegular.txt";
input;
list;
run;

data _null_;
infile "dir\TestUTF.txt";
input;
list;
run;

I thought the byte-order mark would appear in the second log, but something different happened. I got this message in the log:

NOTE: A byte-order mark in the file "dir\TestUTF.txt" (for fileref "#LN00059") indicates that the
data is encoded in "utf-8". This encoding will be used to process the file.

This says to me that if you run the same program (put an "if _n_ = 5 then stop;" line after "list;" to prevent it doing the whole file), the log should tell you what the byte-order mark in the file is.

Tom

Autotelic · Posted 11-06-2017 10:24 AM

NOTE: A byte-order mark in the file "file_path_name" (for fileref "#LN00007") indicates that the data is encoded in "utf-16le".  This 
      encoding will be used to process the file.

Does this help?

TomKari · Posted 11-06-2017 11:23 AM

I'm going to assume that this means we can take all the speculation about whether the file is encoded with something different off the table, and assume that the file is UTF-16.

It looks to me like there are some invalid characters (characters that are not defined in UTF-16) at various places in the file. These characters are somehow interfering with your CRLF end of line indications, and also causing the transcoding error messages.

I'm not a specialist on encoding, so I'm not sure where to go from here. Is it possible to investigate the process that creates the file, to somehow ensure that everything in it is valid UTF-16?

Tom

Autotelic · Posted 11-06-2017 01:22 PM

Is it possible to investigate the process that creates the file, to somehow ensure that everything in it is valid UTF-16?

I think the problem might be here. This is the only file which this happens from a few of other similar files. It happens even after I extract it from our platform again.
In fact, this exact same lines appear in other files and it runs fine.

I think the problem is with the last 0 in line before the missing one.

Having said this, I can import the file in Pandas with the UTF-16. I don't understand why it would work in Pandas and not in SAS.

Furthermore, I can't find any weird characters in the file.

TomKari · Posted 11-06-2017 03:06 PM

I don't know Pandas, so these comments are pure speculation. I wonder if it's possible that Pandas processes invalid characters for the file encoding differently, changing or dropping them, while SAS throws an error.

The challenge for finding the weird characters is figure out how to examine them! I've never had to do this, so I don't have any suggestions.

Tom · Posted 11-06-2017 03:57 PM

I suspect that you have had some transcoding error in generating the file.

Here a simple program that will get you the hex code of all of the distinct characters in the file and find the number of times they appear and the first and last location. I had it count the BOM character as position 0.

Hopefully the problem character will be one that appears rarely.

data test ;
  infile utf16 recfm=n encoding='any';
  length position 8 utf16 $2 utf16hex $4 ;
  input utf16 $char2. @@ ;
  utf16hex=put(utf16,$hex4.);
  output;
  position+1;
run;
proc summary data=test nway ;
  class utf16hex ;
  var position ;
  output out=summary (drop=_type_) min=first_position max=last_position ;
run;
proc sort data=summary ;
  by _freq_ ;
run;
proc print;
run;

If you figure out for example that character '9900'x is the one that is causing the issue then replace it with something else.

For example you could use a step like this to copy the file character by character and replace the offending character with a space.

data _null_ ;
  infile utf16 recfm=n ;
  input utf16 $char2. @@ ;
  if utf16 ='9900'x then utf16='2000'x ;

  file newutf16 recfm=f lrecl=2 ;
  put utf16 $char2. @;
run;

Autotelic · Posted 11-07-2017 07:35 AM

Hi, Tom.

Here are the characters that come up only once.

I used the converter https://r12a.github.io/apps/conversion/ to try to see what they look like and I can't find them anywhere near where the error occurs. Can't find them if I use the search feature either.

Tom · Posted 11-07-2017 09:56 AM

You probably need to isolate to 4 or 5 lines around where the issue is occurring in your original file.

Then test that small sample and see if the error still happens when reading it normally. If so then you can examine that short file to figure out if there is some specific character or sequence of characters that is causing the issue.

You could try making the subset using your NOTEPAD+ editor. Just make sure to save the subset using the same encoding or else NOTEPAD+ might fix the problem characters for you. (But then again if it does then you might just use NOTEPAD+ to resave the original file and fix the issue that way!).

If you cannot either get NOTEPAD+ to save the lines or it does not recreate the issue then perhaps you can use code like below to copy specific lines.

So this program will copy lines 1,4,5,6 only. Just change the condition in the IF statement to select lines near where your original DATA step is seeing the issue. Make sure to copy line one in this data step to retain the BOM mark at the beginning the file. Otherwise you will need to tell SAS what encoding to use when you try reading it. It is assuming that the file is using UTF-16LE so it is looking for '0A00'x following '0D00'x to indicate that a new line is starting.

data _null_;
  retain line 1;
  infile utf16 recfm=n ;
  file copy recfm=f lrecl=2;
  input ch $char2. @@;
  if line = 1 or 4<= line <=6 then put ch $char2. @ ;
  if ch='0A00'x and lag(ch)='0D00'x then line+1;
run;

With this little file of only a few lines you can try running your import program and see if it also throws an error. If it does then dump the whole file using a data step like this to see all of the characters without SAS trying to transcode them.

data _null_;
  infile copy recfm=f lrecl=100 encoding='any';
  input;
  list;
run;

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Re: Error importing line in CSV (Invalid data for variable in line line column_range)

Registration is open

SAS Training: Just a Click Away