As I was running the tests with the different number of observations, I noticed something in the log that had not caught my attention before:
NOTE: A byte-order mark in the file "/sasem/gbmkuser/search/troy_1Q_2018_Report.csv" (for fileref "#LN00062") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.
When I searched Google on this, I found a SAS page (http://support.sas.com/kb/19/028.html) that says: Unicode SAS with UTF-8 encoding uses 1 to 4 bytes to handle Unicode data. It is possible for the 8-bit characters to be expanded by 2 to 3 bytes when transcoding occurs, which causes the truncation error. You might not see these error messages with English SAS because the limitation is unique to Unicode SAS. There is no workaround for this issue. To avoid getting this error, do not use an external file that contains 8-bit characters.
I don't have any control over what users put into our search box. I'm assuming that one user put in something that looks like UTF-8, and another user put in something that was an 8-bit character. Seeing the comment "There is no workaround for this issue" is daunting.
At this point, it looks like the file structure itself is okay, but there are characters within the file that are causing problems for SAS. My questions are:
Is there a way to force the system to use a different encoding, instead of it automatically switching to that because of the contents of that record?
If I am able to force that switch, would it still fail with an I/O error because of the record that has the UTF-8 looking character?
Any other ideas how I can get SAS to work with this data?
I will add that one of my co-workers suggested that I bring the file into Excel, and save it as an Excel file. I gave this a try and it actually worked - Excel was able to open the file with no problem!
Then I used PROC IMPORT like Tom suggested on the Excel file, with the dbms=xlsx, and that worked!!
So this solved today's problem, but I was lucky this time that the file only had 969k records. If it had been too much larger, I would not have been able to open the whole file in Excel.
Here's where I'm scratching my head, though... Excel popped this file open in less than a minute, and it dealt with the characters in some way that cleansed them enough to get them into SAS. But it seems like if Excel can do it, then SAS should definitely be able to do it too. SAS is the tool I go to when nothing else works, and I've never found a problem I couldn't solve directly in SAS. So this is a little surprising to me that this SAS issue is being reported by SAS support as having "no workaround".
... View more