BookmarkSubscribeRSS Feed
Autotelic
Obsidian | Level 7
In Notepad++, going to Plugins\Converter\ASCII -> HEX, didn't make any chances at the beginning of the file. Any other way I can check this?
TomKari
Onyx | Level 15

I thought perhaps I could use SAS to pull them out, so I created one notepad file with default encoding, and one with UTF8. I then ran the following code:

 

data _null_;
infile "dir\TestRegular.txt";
input;
list;
run;

data _null_;
infile "dir\TestUTF.txt";
input;
list;
run;

I thought the byte-order mark would appear in the second log, but something different happened. I got this message in the log:

 

NOTE: A byte-order mark in the file "dir\TestUTF.txt" (for fileref "#LN00059") indicates that the
data is encoded in "utf-8". This encoding will be used to process the file.

 

This says to me that if you run the same program (put an "if _n_ = 5 then stop;" line after "list;" to prevent it doing the whole file), the log should tell you what the byte-order mark in the file is.

 

Tom

 

Autotelic
Obsidian | Level 7
NOTE: A byte-order mark in the file "file_path_name" (for fileref "#LN00007") indicates that the data is encoded in "utf-16le".  This 
      encoding will be used to process the file.

Does this help?

TomKari
Onyx | Level 15

I'm going to assume that this means we can take all the speculation about whether the file is encoded with something different off the table, and assume that the file is UTF-16.

 

It looks to me like there are some invalid characters (characters that are not defined in UTF-16) at various places in the file. These characters are somehow interfering with your CRLF end of line indications, and also causing the transcoding error messages.

 

I'm not a specialist on encoding, so I'm not sure where to go from here. Is it possible to investigate the process that creates the file, to somehow ensure that everything in it is valid UTF-16?

 

Tom

Autotelic
Obsidian | Level 7

 Is it possible to investigate the process that creates the file, to somehow ensure that everything in it is valid UTF-16?

I think the problem might be here. This is the only file which this happens from a few of other similar files. It happens even after I extract it from our platform again.
In fact, this exact same lines appear in other files and it runs fine.

I think the problem is with the last 0 in line before the missing one.

 

Having said this, I can import the file in Pandas with the UTF-16. I don't understand why it would work in Pandas and not in SAS.

Furthermore, I can't find any weird characters in the file.

 

TomKari
Onyx | Level 15

I don't know Pandas, so these comments are pure speculation. I wonder if it's possible that Pandas processes invalid characters for the file encoding differently, changing or dropping them, while SAS throws an error.

 

The challenge for finding the weird characters is figure out how to examine them! I've never had to do this, so I don't have any suggestions.

Tom
Super User Tom
Super User

I suspect that you have had some transcoding error in generating the file.

Here a simple program that will get you the hex code of all of the distinct characters in the file and find the number of times they appear and the first and last location.  I had it count the BOM character as position 0.

Hopefully the problem character will be one that appears rarely.

data test ;
  infile utf16 recfm=n encoding='any';
  length position 8 utf16 $2 utf16hex $4 ;
  input utf16 $char2. @@ ;
  utf16hex=put(utf16,$hex4.);
  output;
  position+1;
run;
proc summary data=test nway ;
  class utf16hex ;
  var position ;
  output out=summary (drop=_type_) min=first_position max=last_position ;
run;
proc sort data=summary ;
  by _freq_ ;
run;
proc print;
run;

If you figure out for example that character '9900'x is the one that is causing the issue then replace it with something else.

For example you could use a step like this to copy the file character by character and replace the offending character with a space.

data _null_ ;
  infile utf16 recfm=n ;
  input utf16 $char2. @@ ;
  if utf16 ='9900'x then utf16='2000'x ;

  file newutf16 recfm=f lrecl=2 ;
  put utf16 $char2. @;
run;
Autotelic
Obsidian | Level 7

Hi, Tom.

 

Here are the characters that come up only once.

image.png

 

I used the converter https://r12a.github.io/apps/conversion/ to try to see what they look like and I can't find them anywhere near where the error occurs. Can't find them if I use the search feature either.

Tom
Super User Tom
Super User

You probably need to isolate to 4 or 5 lines around where the issue is occurring in your original file.

Then test that small sample and see if the error still happens when reading it normally.  If so then you can examine that short file to figure out if there is some specific character or sequence of characters that is causing the issue.

You could try making the subset using your NOTEPAD+ editor. Just make sure to save the subset using the same encoding or else NOTEPAD+ might fix the problem characters for you. (But then again if it does then you might just use NOTEPAD+ to resave the original file and fix the issue that way!).

If you cannot either get NOTEPAD+ to save the lines or it does not recreate the issue then perhaps you can use code like below to copy specific lines.  

 

So this program will copy lines 1,4,5,6 only.  Just change the condition in the IF statement to select lines near where your original DATA step is seeing the issue.  Make sure to copy line one in this data step to retain the BOM mark at the beginning  the file. Otherwise you will need to tell SAS what encoding to use when you try reading it.  It is assuming that the file is using UTF-16LE so it is looking for '0A00'x following '0D00'x to indicate that a new line is starting.

data _null_;
  retain line 1;
  infile utf16 recfm=n ;
  file copy recfm=f lrecl=2;
  input ch $char2. @@;
  if line = 1 or 4<= line <=6 then put ch $char2. @ ;
  if ch='0A00'x and lag(ch)='0D00'x then line+1;
run;

With this little file of only a few lines you can try running your import program and see if it also throws an error. If it does then dump the whole file using a data step like this to see all of the characters without SAS trying to transcode them.

data _null_;
  infile copy recfm=f lrecl=100 encoding='any';
  input;
  list;
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 38 replies
  • 6137 views
  • 6 likes
  • 7 in conversation