BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
georgemeng
Calcite | Level 5

When reading in a UTF-8 file with Unicode in it as shown below:

georgemeng_0-1592684876247.png

It took me sometime to find out that the unicode is 3 bytes? So in order to read the file correctly, I have to update the file to:

 

georgemeng_2-1592685033888.png

This is OK for small file, but for a big file with a lot of unicode characters, it is just not practical.

 

My question is when you have unicode characters in the file, is there a way to let SAS process the unicode just as normal ascii characters? I mean without have to manually update the text file with the unicode = 3 bytes in mind?

 

I have attached the code and text file, please let me know your thoughts.

 

Thank you!

 

George

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You cannot do it with the INFILE/INPUT statements.  Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable.  Make sure your lines are not longer than 32K bytes.

View solution in original post

4 REPLIES 4
Patrick
Opal | Level 21

Add the following to your INFILE statement: encoding='utf-8'

This lets me read your "notworking" text file.

options pagesize=60 linesize=80 pageno=1 nodate;

data attractions;
	infile '~/test/tour13 - Notworking.txt' truncover encoding='utf-8';
	input City $ 1-9 Museums 11 Galleries 13 Other 15 TourGuide $ 17-25 YearsExperience 26;
run;

proc print data=attractions;
	title 'Data Set MYLIB.ATTRACTIONS';
run;

 

The next part is if your SAS session runs single byte or multi byte. Execute below:

proc options option=encoding;
run;

In my environment the session is single byte and not all UTF-8 encoded characters can get mapped to a single byte. This leads for me to a Warning in the SAS log:

WARNING: A character that could not be transcoded has been replaced in record 1.

And to a garbled character:

Capture.JPG

Might not be a problem for you if your SAS session is multibyte (i.e. UTF-8).

 

 

georgemeng
Calcite | Level 5

Thank you very much for spending the time over the weekend!

In my environment, I added encoding="utf-8", also checked the session option, the session encoding is also UTF-8, but YearsExperience is still missing for first record.

 

Your result shows correctly the number 2, is it possible that is because when the three byte unicode character was replace by a single byte "?" ?  

A character that could not be transcoded has been replaced in record 1

Just wonder why it worked in your environment not mine. 

I am really new to SAS, so please bear with me.

 

My envrionment:

SAS University Editor on Windows 10 through VirtualBox.

 

Thanks again!

 

George

Tom
Super User Tom
Super User

You cannot do it with the INFILE/INPUT statements.  Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable.  Make sure your lines are not longer than 32K bytes.

georgemeng
Calcite | Level 5

Thank you very much so I don't spend time on this.

Will explore _INFILE_ to get the file content in.

 

Have a great weekend!

 

George

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2816 views
  • 0 likes
  • 3 in conversation