When reading in a UTF-8 file with Unicode in it as shown below:
It took me sometime to find out that the unicode is 3 bytes? So in order to read the file correctly, I have to update the file to:
This is OK for small file, but for a big file with a lot of unicode characters, it is just not practical.
My question is when you have unicode characters in the file, is there a way to let SAS process the unicode just as normal ascii characters? I mean without have to manually update the text file with the unicode = 3 bytes in mind?
I have attached the code and text file, please let me know your thoughts.
Thank you!
George
You cannot do it with the INFILE/INPUT statements. Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable. Make sure your lines are not longer than 32K bytes.
Add the following to your INFILE statement: encoding='utf-8'
This lets me read your "notworking" text file.
options pagesize=60 linesize=80 pageno=1 nodate;
data attractions;
infile '~/test/tour13 - Notworking.txt' truncover encoding='utf-8';
input City $ 1-9 Museums 11 Galleries 13 Other 15 TourGuide $ 17-25 YearsExperience 26;
run;
proc print data=attractions;
title 'Data Set MYLIB.ATTRACTIONS';
run;
The next part is if your SAS session runs single byte or multi byte. Execute below:
proc options option=encoding;
run;
In my environment the session is single byte and not all UTF-8 encoded characters can get mapped to a single byte. This leads for me to a Warning in the SAS log:
WARNING: A character that could not be transcoded has been replaced in record 1.
And to a garbled character:
Might not be a problem for you if your SAS session is multibyte (i.e. UTF-8).
Thank you very much for spending the time over the weekend!
In my environment, I added encoding="utf-8", also checked the session option, the session encoding is also UTF-8, but YearsExperience is still missing for first record.
Your result shows correctly the number 2, is it possible that is because when the three byte unicode character was replace by a single byte "?" ?
A character that could not be transcoded has been replaced in record 1
Just wonder why it worked in your environment not mine.
I am really new to SAS, so please bear with me.
My envrionment:
SAS University Editor on Windows 10 through VirtualBox.
Thanks again!
George
You cannot do it with the INFILE/INPUT statements. Instead use the KSUBSTR() function to extract the number of CHARACTERS you want from the automatic _INFILE_ variable. Make sure your lines are not longer than 32K bytes.
Thank you very much so I don't spend time on this.
Will explore _INFILE_ to get the file content in.
Have a great weekend!
George
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.