BookmarkSubscribeRSS Feed
rajeshitboys
Calcite | Level 5

Hi,

I am working on Social Media Data. I have my input file as ';' delimiter file. In twitter and Facebook contains lots of symbols :smileyplain:.

I wrote a import file code like:

       data WORK.Test    ;

       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */

      infile 'D:\Rajesh_sun\POCData\SrBachchan.csv' delimiter = ';' MISSOVER DSD lrecl=13106

firstobs=2 TERMSTR=CRLF;

          informat level $3. ;

          informat id $3. ;

          informat parent_id $3. ;

          informat object_id $20. ;

          informat object_type $6. ;

          informat query_status $15. ;

          informat query_time $28. ;

          informat query_type $23. ;

          informat created_at $32. ;

          informat user_screen_name $17. ;

          informat favorite_count $3. ;

          informat retweet_count $4. ;

          informat entities_hashtags___text $12. ;

          informat entities_user_mentions___name $43. ;

          informat entities_urls___display_url $2. ;

          informat in_reply_to_user_id $11. ;

         informat in_reply_to_screen_name $12. ;

         informat in_reply_to_status_id $2. ;

         informat text $142. ;

         format level $3. ;

         format id $3. ;

         format parent_id $3. ;

         format object_id $20. ;

         format object_type $6. ;

         format query_status $15. ;

         format query_time $28. ;

         format query_type $23. ;

         format created_at $32. ;

         format user_screen_name $17. ;

         format favorite_count $3. ;

         format retweet_count $4. ;

         format entities_hashtags___text $12. ;

         format entities_user_mentions___name $43. ;

         format entities_urls___display_url $2. ;

         format in_reply_to_user_id $11. ;

         format in_reply_to_screen_name $12. ;

         format in_reply_to_status_id $2. ;

         format text $142. ;

      input

                  level $

                  id $

                  parent_id $

                  object_id $

                  object_type $

                  query_status $

                  query_time $

                  query_type $

                  created_at $

                  user_screen_name $

                  favorite_count $

                  retweet_count $

                  entities_hashtags___text $

                  entities_user_mentions___name $

                  entities_urls___display_url $

                  in_reply_to_user_id $

                  in_reply_to_screen_name $

                  in_reply_to_status_id $

                  text $

      ;

      if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */

      run;


While I execute the code i am getting an error like

NOTE: A byte-order mark in the file "D:\Rajesh_sun\POCData\SrBachchan.csv" (for fileref

      "#LN00023") indicates that the data is encoded in "utf-8".  This encoding will be used to

      process the file.

NOTE: The infile 'D:\Rajesh_sun\POCData\SrBachchan.csv' is:

      Filename=D:\Rajesh_sun\POCData\SrBachchan.csv,

      RECFM=V,LRECL=52424,File Size (bytes)=3390532,

      Last Modified=17Oct2014:21:42:06,

      Create Time=17Oct2014:21:51:09

ERROR: Invalid string.

FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.

       Aborted during the EXECUTION phase.

NOTE: 9 records were read from the infile 'D:\Rajesh_sun\POCData\SrBachchan.csv'.

      The minimum record length was 72.

      The maximum record length was 364.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.TEST may be incomplete.  When this step was stopped there were 9

         observations and 19 variables.

WARNING: Data set WORK.TEST was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

It have various types of Symbols. I attached a sample pics and file as well. Please help me to overcome this issue.

issue.png

29 REPLIES 29
Tom
Super User Tom
Super User

What version of SAS?  It works fine for me with SAS 9.4.

filename xxx zip 'C:\Downloads\SrBachchan.csv.zip' ;

data _null_ ;

  infile xxx('SrBachchan.csv') lrecl=30000 dsd dlm= ';' truncover;

  length x1-x20 $500;

  input x1-x20 ;

run;


NOTE: A total of 11829 records were read from the infile library XXX.

      The minimum record length was 0.

      The maximum record length was 590.

NOTE: 11829 records were read from the infile XXX('SrBachchan.csv').

      The minimum record length was 0.

      The maximum record length was 590.

Patrick
Opal | Level 21

That's an interesting one. If using the zip engine then things work as in Tom's code. If first extracting the file using 7-zip or WinZip then I'm getting the same error like the OP.

filename xxx 'C:\temp' ;

data _null_ ;

  infile xxx('SrBachchan.csv') lrecl=30000 obs=10;

  input;

run;

ERROR: Invalid string.

FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.  Aborted during the EXECUTION phase.

The issue seems to appear for the first time at the end of line 10 in the source file and if I remove below string from source then I can read past of line 10. Not sure why though.

Capture.PNG

Having tested a bit further:

It appears that when using the zip engine SAS doesn't detect the BOM and reads the file using the session encoding (WLATIN1 in my case). In doing so things work.

When extracting the file first and then reading the .csv SAS detects the BOM and UFT-8 gets used; and that's where the issues occur.

Using Tom's code with the zip engine but setting the encoding to UTF-8 the code also throws an error.

filename xxx zip 'C:\temp\SrBachchan.csv.zip' ;

data _null_ ;

  infile xxx('SrBachchan.csv') lrecl=30000 dsd dlm= ';' truncover encoding='utf-8';

I've done a test where I've opened the .csv with Notepad++ and then saved it as UTF-8 without BOM. After this I read the file using WLATIN1 and again didn't get an error.

So it has something to do with UTF-8. I still don't understand why this is happening though.

rajeshitboys
Calcite | Level 5

Hi Tom,

Even I am also working on 9.4,  for this code also i am getting an error message.

NOTE: No encoding was specified for the fileref "#LN00013" (for the file "d:\SrBachchan.csv").  A

      byte order mark in the file indicates that the data is encoded in "utf-8".  This encoding will

      be used to process the file.

NOTE: The infile 'd:\SrBachchan.csv' is:

      Filename=d:\SrBachchan.csv,

      RECFM=V,LRECL=180000,File Size (bytes)=3390532,

      Last Modified=17Oct2014:21:42:08,

      Create Time=19Oct2014:23:40:47

ERROR: Invalid string.

FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.

       Aborted during the EXECUTION phase.

NOTE: 10 records were read from the infile 'd:\SrBachchan.csv'.

      The minimum record length was 72.

      The maximum record length was 364.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

Tom
Super User Tom
Super User

It seems that when code I posted is run using WLATIN1 encoding the file is just read it byte by byte and the UTF8 encoding is ignored.

You can see by looking at the value of the first field for this header line that it has ignored the encoding information and treated it as part of the data.

x1="level"

EFBBBF226C6576656C22

You can read it without errors by using a SAS session that is using UTF8 encoding.

rajeshitboys
Calcite | Level 5

Hi Tom,

I try to encoding like through encoding=WLATIN1 option i am getting an error like,

ERROR: The file "d:\SrBachchan.csv" could not be opened.  A byte order mark indicates that the data

       is encoded in "utf-8".  This conflicts with the "wlatin1" encoding that was specified for the

       fileref "#LN00014".

Here is the code which i submitted,

data _null_ ;

  infile 'd:\SrBachchan.csv' lrecl=30000 dsd dlm= ';' truncover encoding=WLATIN1;

  length x1-x20 $5099;

  input x1-x20 ;

run;

Tom
Super User Tom
Super User

That is backwards. You need to start SAS using UTF-8 encoding instead of WLATIN1 encoding. Then you can read that file without errors because SAS will not try to translate characters from UTF-8 to WLATIN1.  The problem is that the file contains characters that do not have translations in WLATIN1 encoding.

Patrick
Opal | Level 21

If it only was a translation issue then there shouldn't be a fatal error thrown.

rajeshitboys
Calcite | Level 5

Hi Tom,

Still I am getting the same error.

Statement:

     infile 'D:\Rajesh_sun\POCData\SrBachchan.csv' delimiter = ';' MISSOVER DSD lrecl=13106

firstobs=2 encoding="utf-8";

Error Log:

NOTE: The infile 'D:\Rajesh_sun\POCData\SrBachchan.csv' is:

      Filename=D:\Rajesh_sun\POCData\SrBachchan.csv,

      RECFM=V,LRECL=52424,File Size (bytes)=3390532,

      Last Modified=17Oct2014:21:42:06,

      Create Time=17Oct2014:21:51:09

ERROR: Invalid string.

FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.

       Aborted during the EXECUTION phase.

NOTE: 9 records were read from the infile 'D:\Rajesh_sun\POCData\SrBachchan.csv'.

      The minimum record length was 72.

      The maximum record length was 364.

NOTE: The SAS System stopped processing this step because of errors.

Patrick
Opal | Level 21

What Tom suggests is that you change your session encoding to UTF-8. That's something you need to do in the config file and it will affect all SAS sessions starting up with this config file. It also means that you can't use some of the character functions anymore and that you will need to use the ones for multi byte character sets (mbcs). This is documented in the NLS bits of the SAS documentation. SAS(R) 9.4 National Language Support (NLS): Reference Guide, Third Edition

UTF-8 encodes characters with up to 4 bytes which allows for a vast amount of different characters like the ones used in the Chinese alphabet. Most people use single byte character sets (sbcs). WLATIN1 is a common sbcs code page for the English Language.

Naturally MBCS can't map 1:1 into SBCS as there are many more different characters possible in MBCS. UTF-8 is the "standard" Internet encoding. But even with UTF-8 the standard letters and numbers of the English alphabet use only 1 byte so here a 1:1 mapping into a single byte code page like WLATIN1 is possible.

When I installed SAS on my laptop I didn't tick the box for "multibyte" and so these modules haven't been installed on my machine. When I tried to set UTF-8 as encoding in my config file I've got a Kernel error when starting SAS. That is o.k. with me as I actually haven't installed these bits.

The way I would expect SAS to behave when reading a UTF-8 encoded source using a single byte SAS session encoding is to map single byte encoded characters in UTF-8 to single bytes in my session - and to throw a warning for all multi byte characters saying something like "WARNING: A character that could not be transcoded has been replaced in record xxx." And then map these bytes one:one resulting in some garbage characters.

What I consider to be a defect is what you're observing: SAS aborting with a fatal Error.

I suggest you open a ticket with SAS Tech Support. Contact SAS Institute Technical Support

If you decide to do so: Please let us know in this track what the outcome was.

Using Notepad++ (download it if you don't have it. It's a really helpful tool) I can see that your .csv is encoded as UTF-8 with a BOM. A BOM is a sequence of a few Bytes at the very beginning of a file. It's nothing you can see when opening the file (unless you look at it with a HEX editor - you can get this as a Notepad++ plugin) BUT: Software like SAS detects this BOM and therefore knows that the file is UTF-8 encoded. That's why SAS complains if you set 'ENCODING="WLATIN1" in your INFILE statement. SAS wants to use UTF-8. That's also why setting ENCODING='UTF-8' in the INFILE statement doesn't help you. that's what SAS already uses and where the issue occurs. The only thing which would help here according to Tom's testing is to set the SAS session encoding to UTF-8. I consider this to be a bit "extreme".

It appears that when using the ZIP engine - that's what Tom did in his first post - SAS ignores the BOM and just uses the session encoding. So if this is something like WLATIN1 for you then things seem to work (you still will get "garbage" for multi byte encoded characters though). Another work around for a once-off tasks and if you're only interested in "standard English letters and number" is to open the source file with Notepad++, switch under menu "Encoding" to "UTF-8 without BOM", save the file and then use in the INFILE statement ENCODING="WLATIN1" (or whatever code page you like). Now - without BOM - SAS won't complain anymore.

What also worked for me: Use in SAS EG the IMPORT Wizard. It took forever and I first thought EG crashed as it was unresponsive for a while - but with a little bit of patience EG 6.1 went over the scanning step and I was able to import the data.

jakarman
Barite | Level 11

UTF-8 is used at a lot places. Internet office all are utf-8 based. There are some other Unicode encodings around as several utf-16 versions.

A standard installation of SAS supporting all languages will have dedicated startups for lantin1 en and one for utf8.  It are the old sas programmers that are missing this changing world of encodings.

That bom notification is an important one. It just is telling you you are having an encoding file and utf-8 is assumed to be the one, may be it is a utf-16 one.

You can open the file with notepad++ it will detect some things of those.  The complete free formatted text could indicate getting problems using default input statements That is the next attention point. 
The quoted strings possible having quotes or other chars. SAS(R) 9.4 Formats and Informats: Reference

At line 11 there is a strange one, not able to interpreted at an earliers place dots are shown. It is utf8 encoded, you can read it with excel (utf8)

 

Please tackle issue for issue understanding them while solving.

---->-- ja karman --<-----
rajeshitboys
Calcite | Level 5

Hi Jaap,

Actually, Its a social media data. So we can't predict where these type of symbols will appear. And we cant able to resolve in manual process. There is 100's of symbols available in Social media's like facebook, twitter, tumblr etc.,

So Its really impossible to run Excel macro to scan and replace in Excel itself.

Community_Help
SAS Employee

Thanks for this question and the great help that followed. I'm moving this post to the Macros community so others might see it, too. Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 29 replies
  • 12145 views
  • 3 likes
  • 6 in conversation