Re: Import issue: Invalid string - Unrecoverable I/O error

BRKS · Posted 11-30-2018 11:42 AM

I'm importing a large file of "search term" strings that users entered into our website, and I got a fatal error:

   ERROR: Invalid string.
     FATAL: Unrecoverable I/O error detected in the execution of the DATA step program. Aborted during the EXECUTION phase.
     NOTE: 190 records were read from the infile "/sasem/gbmkuser/search/2Q_2018_Report.csv".

After the first run, I deleted a LOT of old data, and ran it again (in case the I/O error was being caused by a shortage of space), but I got the same error on the second run (and had 10 times more free space than the size of the file for that run).

I'm an employee using SAS EG, so I'm not really sure how much control I have on changing import settings.

My DATA step is below. I'm hoping that someone can help me change it to solve this issue.

Thanks so much! Barb

data search.file_2Q_2018_Report;
     infile "&filepath/2Q_2018_Report.csv"
          delimiter = ',' MISSOVER DSD lrecl=2000 firstobs=2 ;
     format search_term_orig $200.
             count_success comma10.
             count_fail comma10.
             count_total comma10.
     ;
     input   search_term_orig
             count_success
             count_fail
     ;
     count_total = sum(count_success, count_fail);
run;

AMSAS · Posted 11-30-2018 01:36 PM

Hi Barb

The first thing I would suggest is to run it against a really small input file or a smaller sample of the file you are reading. I see you are already using the FIRSTOBS option, you could add the OBS option and set it to 100. Then run your code and see if that works (I suspect it will).

Next, I'd set OBS to half the number of rows in the input data and see if that runs (e.g. 1M rows in source file, set Obs=500000)

If that runs, then change FIRSTOBS to 500000 and remove or set OBS=MAX

I suspect there is something "odd" about the source data that may be causing the issue.

Andy

BRKS · Posted 11-30-2018 02:30 PM

Hi Andy - thanks so much. I ran various obs amounts, and it looks like the issue is with record #191. If I run anything from obs=190 or lower, it works. If I run anything from obs=191 or higher, I get the failure. What could I do to prevent that record from failing? Thanks so much!

tomrvincent · Posted 11-30-2018 01:38 PM

I'd do this:

PROC IMPORT OUT=search.file_2Q_2018_Report
REPLACE
DATAFILE="&filepath/2Q_2018_Report.csv"
DBMS=csv;
GUESSINGROWS=100000
;
run;

BRKS · Posted 11-30-2018 02:33 PM

Hi Tom, Unfortunately I got the same error. 😞 This is a tough one!

BRKS · Posted 11-30-2018 02:55 PM

As I was running the tests with the different number of observations, I noticed something in the log that had not caught my attention before:

NOTE: A byte-order mark in the file "/sasem/gbmkuser/search/troy_1Q_2018_Report.csv" (for fileref "#LN00062") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.

When I searched Google on this, I found a SAS page (http://support.sas.com/kb/19/028.html) that says: Unicode SAS with UTF-8 encoding uses 1 to 4 bytes to handle Unicode data. It is possible for the 8-bit characters to be expanded by 2 to 3 bytes when transcoding occurs, which causes the truncation error. You might not see these error messages with English SAS because the limitation is unique to Unicode SAS. There is no workaround for this issue. To avoid getting this error, do not use an external file that contains 8-bit characters.

I don't have any control over what users put into our search box. I'm assuming that one user put in something that looks like UTF-8, and another user put in something that was an 8-bit character. Seeing the comment "There is no workaround for this issue" is daunting.

At this point, it looks like the file structure itself is okay, but there are characters within the file that are causing problems for SAS. My questions are:

Is there a way to force the system to use a different encoding, instead of it automatically switching to that because of the contents of that record?
If I am able to force that switch, would it still fail with an I/O error because of the record that has the UTF-8 looking character?
Any other ideas how I can get SAS to work with this data?

I will add that one of my co-workers suggested that I bring the file into Excel, and save it as an Excel file. I gave this a try and it actually worked - Excel was able to open the file with no problem!

Then I used PROC IMPORT like Tom suggested on the Excel file, with the dbms=xlsx, and that worked!!

So this solved today's problem, but I was lucky this time that the file only had 969k records. If it had been too much larger, I would not have been able to open the whole file in Excel.

Here's where I'm scratching my head, though... Excel popped this file open in less than a minute, and it dealt with the characters in some way that cleansed them enough to get them into SAS. But it seems like if Excel can do it, then SAS should definitely be able to do it too. SAS is the tool I go to when nothing else works, and I've never found a problem I couldn't solve directly in SAS. So this is a little surprising to me that this SAS issue is being reported by SAS support as having "no workaround".

tomrvincent · Posted 11-30-2018 03:18 PM

You could write a macro loop that reads one line in at a time in various encodings and keep the successful rows.

BRKS · Posted 11-30-2018 03:38 PM

I'm not that smart, Tom! 😛 😄

AMSAS · Posted 11-30-2018 03:34 PM

Here's a couple of other SAS Notes that may help

Usage Note 38696

Usage Note 13727

I think you are on the right path with the comment
"I don't have any control over what users put into our search box. I'm assuming that one user put in something that looks like UTF-8, and another user put in something that was an 8-bit character. Seeing the comment "There is no workaround for this issue" is daunting."

If those notes don't help, then I suggest you open a TS Track

BRKS · Posted 11-30-2018 03:41 PM

Thanks AMSAS! when I click on those usage Note links, I get an error:

This site can’t be reached soul2.unx.sas.com’s server IP address could not be found.
Search Google for soul 2 unx sas 8081 TESSA main My Note
ERR_NAME_NOT_RESOLVED

Are they only accessible by SAS employees maybe?

Thanks!
Barb

AMSAS · Posted 11-30-2018 04:02 PM

I fixed them - Sorry about that

tomrvincent · Posted 11-30-2018 03:46 PM

http://support.sas.com/kb/38/696.html
http://support.sas.com/kb/13/727.html

AMSAS · Posted 11-30-2018 03:59 PM

Thanks Tom - That's what I get for trying to reply quickly, indeed I posted the internal SAS employee links

FreelanceReinh · Posted 12-02-2018 05:02 PM

Hi @BRKS,

From your description of the problem I conclude:

Your CSV file starts with a "byte-order mark" (BOM) -- probably 'EFBBBF'x -- indicating UTF-8 encoding. (I found this hex string in the document http://www.unicode.org/faq/utf_bom.html#29. linked in http://support.sas.com/kb/13/727.html.)
Line 191 of the CSV file contains a byte sequence that is "illegal" in UTF-8 encoding. According to the document mentioned above this might be a sequence of the form <110xxxxx₂ 0xxxxxxx₂>.

I was able to replicate the issue with a small file and then read it using a rudimentary workaround.

/* Create a test file (with SAS 9.4M5 on a Windows machine using WLATIN1 encoding) */

data _null_;
file 'C:\Temp\bomtst.csv' dsd;
set sashelp.prdsale(obs=3);
if _n_=1 then put 'EFBBBF'x @; /* BOM (byte-order mark) */
if _n_=2 then country='Ägypten'; /* "illegal" in UTF-8 */
if _n_=3 then country='C384'x||'gypten'; /* UTF-8 encoding */
put product country actual;
run;

/* Replicate the issue */

data test;
infile 'C:\Temp\bomtst.csv' dsd;
input (product country) (:$10.) actual :dollar.;
run;

Log:

NOTE: A byte-order mark in the file "C:\Temp\bomtst.csv" (for fileref "#LN00034") indicates that the data is encoded in "utf-8".  This encoding
      will be used to process the file.
NOTE: The infile 'C:\Temp\bomtst.csv' is:
(...)

ERROR: Invalid string.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.  Aborted during the EXECUTION phase.
NOTE: 1 record was read from the infile 'C:\Temp\bomtst.csv'.
      The minimum record length was 19.
      The maximum record length was 19.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEST may be incomplete.  When this step was stopped there were 1 observations and 3 variables.

/* Rudimentary workaround */

/* Step 1: Verify that BOM is present */

data _null_;
infile 'C:\Temp\bomtst.csv' recfm=N;
input c $char10.;
put c $hex20.;
stop;
run;

Log:

EFBBBF534F46412C4341

/* Step 2: Create a copy of the CSV file with BOM removed */

data _null_;
infile 'C:\Temp\bomtst.csv' recfm=N;
file 'C:\Temp\nobomtst.csv' recfm=N;
if _n_=1 then input +3 @@; /* skip first three bytes (adapt if BOM is longer) */
input c $char1.;
put c $1.;
run;

/* Step 3: Read file without BOM and create a copy of the affected variable (here: COUNTRY),
           transcoded from UTF-8 to WLATIN1 (assuming this is the session encoding) */

data check(rename=(country=country_wlatin1));
length country_utf8 $10;
infile 'C:\Temp\nobomtst.csv' dsd;
input (product country) (:$10.) actual :dollar.;
country_utf8=kcvt(country, 'utf-8', 'wlatin1');
run;

proc print data=check(obs=3);
var product country_wlatin1 country_utf8 actual;
run;

PROC PRINT output (screenshot to avoid more transcoding issues)

Now you could select observations with country_wlatin1 ne country_utf8 from dataset CHECK and look at some or all of them to make an informed decision on the appropriate handling of this variable.

@BRKS wrote:

When I searched Google on this, I found a SAS page (http://support.sas.com/kb/19/028.html) that says: (...) There is no workaround for this issue.

(...)

Excel was able to open the file with no problem!

(...)

But it seems like if Excel can do it, then SAS should definitely be able to do it too. SAS is the tool I go to when nothing else works, and I've never found a problem I couldn't solve directly in SAS. So this is a little surprising to me that this SAS issue is being reported by SAS support as having "no workaround".

I think you misunderstood that Problem Note 19028: It's about a problem that was reported for SAS release 9.1 TS1M3 SP4 and fixed (at least on most platforms) in release 9.2 TS1M0 (see table "Operating System and Release Information"). Moreover, in the tab "Hot Fix" there are two links to hot fixes for the affected 9.1 release.

I fully agree that there must be a way to handle the situation with SAS and I would regard using Excel for this as "casting out devils by Beelzebub."

In the forum archives there are more discussions about this issue, e.g.:

Invalid string Error while importing a file which contains smiley symbols (mentions the zip engine as a possible remedy)
Changing session character encoding from wlatin1 to utf-8

eschofie · Posted 02-20-2023 09:15 PM

I just had this same error message and was able to track down the problem - in my dataset, there was an EMOJI in an open-text field. When I opened the CSV file in Excel I could see the silly thing... so annoyed that RedCap let that in. I don't know how one could possibly search for or tell SAS to screen them when reading in, but in case anyone has the same issue.

SAS Training: Just a Click Away