BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
georgemeng
Calcite | Level 5

When reading in a UTF-8 file with Unicode in it as shown below:

georgemeng_0-1592684876247.png

It took me sometime to find out that the unicode is 3 bytes? So in order to read the file correctly, I have to update the file to:

 

georgemeng_2-1592685033888.png

This is OK for small file, but for a big file with a lot of unicode characters, it is just not practical.

 

My question is when you have unicode characters in the file, is there a way to let SAS process the unicode just as normal ascii characters? I mean without have to manually update the text file with the unicode = 3 bytes in mind?

 

I have attached the code and text file, please let me know your thoughts.

 

Thank you!

 

George

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You cannot do it with the INFILE/INPUT statements.  Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable.  Make sure your lines are not longer than 32K bytes.

View solution in original post

4 REPLIES 4
Patrick
Opal | Level 21

Add the following to your INFILE statement: encoding='utf-8'

This lets me read your "notworking" text file.

options pagesize=60 linesize=80 pageno=1 nodate;

data attractions;
	infile '~/test/tour13 - Notworking.txt' truncover encoding='utf-8';
	input City $ 1-9 Museums 11 Galleries 13 Other 15 TourGuide $ 17-25 YearsExperience 26;
run;

proc print data=attractions;
	title 'Data Set MYLIB.ATTRACTIONS';
run;

 

The next part is if your SAS session runs single byte or multi byte. Execute below:

proc options option=encoding;
run;

In my environment the session is single byte and not all UTF-8 encoded characters can get mapped to a single byte. This leads for me to a Warning in the SAS log:

WARNING: A character that could not be transcoded has been replaced in record 1.

And to a garbled character:

Capture.JPG

Might not be a problem for you if your SAS session is multibyte (i.e. UTF-8).

 

 

georgemeng
Calcite | Level 5

Thank you very much for spending the time over the weekend!

In my environment, I added encoding="utf-8", also checked the session option, the session encoding is also UTF-8, but YearsExperience is still missing for first record.

 

Your result shows correctly the number 2, is it possible that is because when the three byte unicode character was replace by a single byte "?" ?  

A character that could not be transcoded has been replaced in record 1

Just wonder why it worked in your environment not mine. 

I am really new to SAS, so please bear with me.

 

My envrionment:

SAS University Editor on Windows 10 through VirtualBox.

 

Thanks again!

 

George

Tom
Super User Tom
Super User

You cannot do it with the INFILE/INPUT statements.  Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable.  Make sure your lines are not longer than 32K bytes.

georgemeng
Calcite | Level 5

Thank you very much so I don't spend time on this.

Will explore _INFILE_ to get the file content in.

 

Have a great weekend!

 

George

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2729 views
  • 0 likes
  • 3 in conversation