When reading in a UTF-8 file with Unicode in it as shown below:
It took me sometime to find out that the unicode is 3 bytes? So in order to read the file correctly, I have to update the file to:
This is OK for small file, but for a big file with a lot of unicode characters, it is just not practical.
My question is when you have unicode characters in the file, is there a way to let SAS process the unicode just as normal ascii characters? I mean without have to manually update the text file with the unicode = 3 bytes in mind?
I have attached the code and text file, please let me know your thoughts.
Thank you!
George
You cannot do it with the INFILE/INPUT statements. Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable. Make sure your lines are not longer than 32K bytes.
Add the following to your INFILE statement: encoding='utf-8'
This lets me read your "notworking" text file.
options pagesize=60 linesize=80 pageno=1 nodate;
data attractions;
infile '~/test/tour13 - Notworking.txt' truncover encoding='utf-8';
input City $ 1-9 Museums 11 Galleries 13 Other 15 TourGuide $ 17-25 YearsExperience 26;
run;
proc print data=attractions;
title 'Data Set MYLIB.ATTRACTIONS';
run;
The next part is if your SAS session runs single byte or multi byte. Execute below:
proc options option=encoding;
run;
In my environment the session is single byte and not all UTF-8 encoded characters can get mapped to a single byte. This leads for me to a Warning in the SAS log:
WARNING: A character that could not be transcoded has been replaced in record 1.
And to a garbled character:
Might not be a problem for you if your SAS session is multibyte (i.e. UTF-8).
Thank you very much for spending the time over the weekend!
In my environment, I added encoding="utf-8", also checked the session option, the session encoding is also UTF-8, but YearsExperience is still missing for first record.
Your result shows correctly the number 2, is it possible that is because when the three byte unicode character was replace by a single byte "?" ?
A character that could not be transcoded has been replaced in record 1
Just wonder why it worked in your environment not mine.
I am really new to SAS, so please bear with me.
My envrionment:
SAS University Editor on Windows 10 through VirtualBox.
Thanks again!
George
You cannot do it with the INFILE/INPUT statements. Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable. Make sure your lines are not longer than 32K bytes.
Thank you very much so I don't spend time on this.
Will explore _INFILE_ to get the file content in.
Have a great weekend!
George
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.