I'm importing a large file of "search term" strings that users entered into our website, and I got a fatal error:
ERROR: Invalid string.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program. Aborted during the EXECUTION phase.
NOTE: 190 records were read from the infile "/sasem/gbmkuser/search/2Q_2018_Report.csv".
After the first run, I deleted a LOT of old data, and ran it again (in case the I/O error was being caused by a shortage of space), but I got the same error on the second run (and had 10 times more free space than the size of the file for that run).
I'm an employee using SAS EG, so I'm not really sure how much control I have on changing import settings.
My DATA step is below. I'm hoping that someone can help me change it to solve this issue.
Thanks so much! Barb
data search.file_2Q_2018_Report;
infile "&filepath/2Q_2018_Report.csv"
delimiter = ',' MISSOVER DSD lrecl=2000 firstobs=2 ;
format search_term_orig $200.
count_success comma10.
count_fail comma10.
count_total comma10.
;
input search_term_orig
count_success
count_fail
;
count_total = sum(count_success, count_fail);
run;
Hi Barb
The first thing I would suggest is to run it against a really small input file or a smaller sample of the file you are reading. I see you are already using the FIRSTOBS option, you could add the OBS option and set it to 100. Then run your code and see if that works (I suspect it will).
Next, I'd set OBS to half the number of rows in the input data and see if that runs (e.g. 1M rows in source file, set Obs=500000)
If that runs, then change FIRSTOBS to 500000 and remove or set OBS=MAX
I suspect there is something "odd" about the source data that may be causing the issue.
Andy
As I was running the tests with the different number of observations, I noticed something in the log that had not caught my attention before:
NOTE: A byte-order mark in the file "/sasem/gbmkuser/search/troy_1Q_2018_Report.csv" (for fileref "#LN00062") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.
When I searched Google on this, I found a SAS page (http://support.sas.com/kb/19/028.html) that says: Unicode SAS with UTF-8 encoding uses 1 to 4 bytes to handle Unicode data. It is possible for the 8-bit characters to be expanded by 2 to 3 bytes when transcoding occurs, which causes the truncation error. You might not see these error messages with English SAS because the limitation is unique to Unicode SAS. There is no workaround for this issue. To avoid getting this error, do not use an external file that contains 8-bit characters.
I don't have any control over what users put into our search box. I'm assuming that one user put in something that looks like UTF-8, and another user put in something that was an 8-bit character. Seeing the comment "There is no workaround for this issue" is daunting.
At this point, it looks like the file structure itself is okay, but there are characters within the file that are causing problems for SAS. My questions are:
I will add that one of my co-workers suggested that I bring the file into Excel, and save it as an Excel file. I gave this a try and it actually worked - Excel was able to open the file with no problem!
Then I used PROC IMPORT like Tom suggested on the Excel file, with the dbms=xlsx, and that worked!!
So this solved today's problem, but I was lucky this time that the file only had 969k records. If it had been too much larger, I would not have been able to open the whole file in Excel.
Here's where I'm scratching my head, though... Excel popped this file open in less than a minute, and it dealt with the characters in some way that cleansed them enough to get them into SAS. But it seems like if Excel can do it, then SAS should definitely be able to do it too. SAS is the tool I go to when nothing else works, and I've never found a problem I couldn't solve directly in SAS. So this is a little surprising to me that this SAS issue is being reported by SAS support as having "no workaround".
Here's a couple of other SAS Notes that may help
I think you are on the right path with the comment
"I don't have any control over what users put into our search box. I'm assuming that one user put in something that looks like UTF-8, and another user put in something that was an 8-bit character. Seeing the comment "There is no workaround for this issue" is daunting."
If those notes don't help, then I suggest you open a TS Track
Hi @BRKS,
From your description of the problem I conclude:
I was able to replicate the issue with a small file and then read it using a rudimentary workaround.
/* Create a test file (with SAS 9.4M5 on a Windows machine using WLATIN1 encoding) */
data _null_;
file 'C:\Temp\bomtst.csv' dsd;
set sashelp.prdsale(obs=3);
if _n_=1 then put 'EFBBBF'x @; /* BOM (byte-order mark) */
if _n_=2 then country='Ägypten'; /* "illegal" in UTF-8 */
if _n_=3 then country='C384'x||'gypten'; /* UTF-8 encoding */
put product country actual;
run;
/* Replicate the issue */
data test;
infile 'C:\Temp\bomtst.csv' dsd;
input (product country) (:$10.) actual :dollar.;
run;
Log:
NOTE: A byte-order mark in the file "C:\Temp\bomtst.csv" (for fileref "#LN00034") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile 'C:\Temp\bomtst.csv' is: (...) ERROR: Invalid string. FATAL: Unrecoverable I/O error detected in the execution of the DATA step program. Aborted during the EXECUTION phase. NOTE: 1 record was read from the infile 'C:\Temp\bomtst.csv'. The minimum record length was 19. The maximum record length was 19. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.TEST may be incomplete. When this step was stopped there were 1 observations and 3 variables.
/* Rudimentary workaround */
/* Step 1: Verify that BOM is present */
data _null_;
infile 'C:\Temp\bomtst.csv' recfm=N;
input c $char10.;
put c $hex20.;
stop;
run;
Log:
EFBBBF534F46412C4341
/* Step 2: Create a copy of the CSV file with BOM removed */
data _null_;
infile 'C:\Temp\bomtst.csv' recfm=N;
file 'C:\Temp\nobomtst.csv' recfm=N;
if _n_=1 then input +3 @@; /* skip first three bytes (adapt if BOM is longer) */
input c $char1.;
put c $1.;
run;
/* Step 3: Read file without BOM and create a copy of the affected variable (here: COUNTRY),
transcoded from UTF-8 to WLATIN1 (assuming this is the session encoding) */
data check(rename=(country=country_wlatin1));
length country_utf8 $10;
infile 'C:\Temp\nobomtst.csv' dsd;
input (product country) (:$10.) actual :dollar.;
country_utf8=kcvt(country, 'utf-8', 'wlatin1');
run;
proc print data=check(obs=3);
var product country_wlatin1 country_utf8 actual;
run;
PROC PRINT output (screenshot to avoid more transcoding issues)
Now you could select observations with country_wlatin1 ne country_utf8 from dataset CHECK and look at some or all of them to make an informed decision on the appropriate handling of this variable.
@BRKS wrote:
When I searched Google on this, I found a SAS page (http://support.sas.com/kb/19/028.html) that says: (...) There is no workaround for this issue.
(...)
Excel was able to open the file with no problem!
(...)
But it seems like if Excel can do it, then SAS should definitely be able to do it too. SAS is the tool I go to when nothing else works, and I've never found a problem I couldn't solve directly in SAS. So this is a little surprising to me that this SAS issue is being reported by SAS support as having "no workaround".
I think you misunderstood that Problem Note 19028: It's about a problem that was reported for SAS release 9.1 TS1M3 SP4 and fixed (at least on most platforms) in release 9.2 TS1M0 (see table "Operating System and Release Information"). Moreover, in the tab "Hot Fix" there are two links to hot fixes for the affected 9.1 release.
I fully agree that there must be a way to handle the situation with SAS and I would regard using Excel for this as "casting out devils by Beelzebub."
In the forum archives there are more discussions about this issue, e.g.:
I just had this same error message and was able to track down the problem - in my dataset, there was an EMOJI in an open-text field. When I opened the CSV file in Excel I could see the silly thing... so annoyed that RedCap let that in. I don't know how one could possibly search for or tell SAS to screen them when reading in, but in case anyone has the same issue.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.