I've searched to communities for solutions to this, or at least explanations, but find none.
Any insights would be helpful.
Here is example code, to illustrate the problems.
Objective: Read a UTF-8 encoded page; parse it to sas variables; and store as sas7bdat ... without corrupting UTF-8 chars.
Example code.
%let w_encoding = UTF-8;
%let r_encoding = UTF-8;
filename WRITE './unitslab_conversions.txt' encoding="&W_ENCODING";
filename READ './unitslab_conversions.txt' encoding="&R_ENCODING";
libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";
proc http
method="GET"
url="https://unitslab.com/"
out=WRITE ;
run;
data data.lbtests;
infile READ length=len lrecl=32767;
input line $varying32767. len;
line = strip(line);
if prxmatch('/^<li>.+\/node/i', line);
if prxmatch('/(microglobulin|cancer|beta|kappa|mass|mullerian)/i', line);
lbtest = strip(scan(line,5,'<>'));
node = strip(scan(line,2,'"'));
*--- Paste in strings from the UTF-8 page
NB - NONE OF THESE MATCH ;
if index(lbtest, 'anti-Mullerian')
or index(lbtest, 'Beta 2-microglobulin (ß2-M)')
or index(lbtest, 'Free ß-subunit of human chorionic gonadotropin (free ßhCG)')
or index(lbtest, 'Kappa (κ)')
then putlog 'INFO: FOUND EXPECTED string match: ' lbtest=;
*--- https://www.w3schools.com/charsets/ref_html_utf8.asp
NB - LIMITED CHAR RANGE ALSO MATCH TRANSCODE (CORRUPTED) chars above \x{2000} ;
if prxmatch('/[\x{03b2}\x{052f}]/', line)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 char: ' lbtest=;
keep lbtest node;
run;
Problems with results:
1 - SAS log - Notice the corrupted UTF-8 chars - I've bolded some UTF-8 hyphens and greek chars:
lbtest=alpha-1‑microglobulin node=/node/89 lbtest=antiâ€Mullerian hormone (AMH) node=/node/155 lbtest=beta - CrossLaps - Degradation products of type I collagen node=/node/164 lbtest=Beta 2â€microglobulin (β2â€M) node=/node/145 lbtest=beta-Hydroxybutyric acid node=/node/225 lbtest=CA 125 (Cancer Antigen 125) node=/node/104 lbtest=CA 15-3 (Cancer Antigen 15-3) node=/node/105 lbtest=CA 72-4 (Antigène de cancer 72-4) node=/node/107 lbtest=CKâ€MB mass - the MB isoenzyme of creatine kinase (quantitative determination) node=/node/157 lbtest=Kappa (κ) light chain node=/node/150
2 - Similar in the resulting SAS7BDAT, despite encoding being UTF-8, but different corruption, appearing to eat up more char bytes:
3 - Similar result viewing SAS7BDAT outside a SAS session - UNIVERSAL VIEWER, but still different corruption of UTF-8 chars:
Thanks for any insights, tips or fixes to this code that can avoid this UTF-8 character corruption.
GG
See here to start your session:
http://support.sas.com/kb/51/586.html
Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).
I might be wrong but I don't think what you are showing means there is corruption.
It might be that the data is fine but the viewers don't support UTF8.
What's the encoding of your SAS environment?
Can you see the txt files in a proper viewer such as notepad++?
Thanks, Chris - Unfortunately, it is not only a display problem.
Bottom line: SAS corrupts characters in the HTML5 UTF-8 range above the Greek/Cyrillic chars.
Reference for char hex values: w3schools page HTML UTF-8 encoding.
Notepad++ preserves but fails to display HTML UTF-8 chars in the range \x{2000}-\x{27bf}
By comparison, SAS 9.04 corrupts chars at least above \x{052F}.
I've updated the code above to demonstrate this.
The resulting log file snippet follows. SAS and Notepad++ have similar display problems:
NOTE: The infile READ is:
Filename=unitslab_conversions.txt,
RECFM=V,LRECL=131068,File Size (bytes)=42931,
Last Modified=23Feb2020:10:51:35,
Create Time=21Feb2020:08:29:19
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=alpha-1‑microglobulin
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=antiâ€Mullerian hormone (AMH)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta - CrossLaps - Degradation products of type I collagen
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Beta 2â€microglobulin (β2â€M)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta-Hydroxybutyric acid
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 125 (Cancer Antigen 125)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 15-3 (Cancer Antigen 15-3)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 72-4 (Antigène de cancer 72-4)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CKâ€MB mass - the MB isoenzyme of creatine kinase (quantitative determination)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Kappa (κ) light chain
Note that the prxmatch not only matches on the Greek letters (small beta and kappa) but also on the hypens \x{2010} and \x{2011}.
SAS has corrupted these UTF-8 chars into something in the Greek/Cyrillic range.
You are raising two different issues here:
1. Matching strings
UTF-8 is I18N Level 2. Not all SAS functions support this. See here (it's a Viya link but is true for SAS too) :
2. Displaying strings.
You haven't said what encoding your SAS session uses.
3. How do the contents of filerefs READ and WRITE compare in terms of encoded values?
Thanks for that reference, Chris - I'll have to review (Internationalization).
I suspect that this will not explain why the approach in my code works for some UTF-8 range beyond ASCII, but not the entire HTML5 UTF-8 range.
Your point (3) is also a clever test - I'll take a look at this, as well.
I've set my session encoding at start-up: -encoding ASCIIANY.
I tried forcing (SBCS) sessions to -encoding "UTF-8", which according to documentation should be valid, although note that utf-8 is at best an after-thought on that page (same for UNIX). But SAS fails to launch with that setting ("invalid" encoding value).
I tried various combinations of DBCS, ENCODING, DBCSTYPE, DBCSLANG - all result in some sort of failure message, failure to start session, so no further testing in this direction is possible in my work environment.
Much appreciated!
See here to start your session:
http://support.sas.com/kb/51/586.html
Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).
Again, Chris: Much appreciated, all of your nuggets of insight!
So all is working now?
Unfortunately I have not been able to fully resolve however SAS is handling HTML5 UTF-8 chars.
Step by step - this is what works, and this is what I cannot get past in a session with -ENCODING ASCIIANY:
1 - As Chris suggested, SAS is preserving HTML UTF-8 chars in the read/write process.
In the following snippet (variation on above code) files "unitslab_conversions.txt" and "unitslab_conv2.txt" match exactly.
Note: TERMSTR=LF overrides default behaviour in my Win environment, to preserve UNIX line feeds (LF)
%let w_encoding = UTF-8;
%let r_encoding = UTF-8;
filename WRITE './unitslab_conversions.txt' encoding="&W_ENCODING";
filename READ './unitslab_conversions.txt' encoding="&R_ENCODING";
filename WRITE2 './unitslab_conv2.txt' encoding="&W_ENCODING" TERMSTR=LF ;
libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";
proc http
method = 'GET'
url = 'https://unitslab.com/'
out = WRITE ;
run;
data _null_;
infile READ length=len lrecl=32767;
input;
file WRITE2 lrecl=32767;
put _infile_;
run;
2 - However, within the string-searching / matching data step, above, even using K-functions do not correctly find HTML5 UTF-8 chars in those text files. They DO find what seem to be corrupted chars. EG, instead of HTML UTF-8 HYPHEN char \x{2011}, K-functions find the corrupted replacement string 'E28091'x, which matches the corrupted display chars "‑".
These searches find none of the expected characters:
*--- https://www.w3schools.com/charsets/ref_html_utf8.asp
THESE DO NOT MATCH ;
if kindex(line, '03B2'x)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 BETA char: ' lbtest=;
if kindex(line, '03BA'x)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 KAPPA char: ' lbtest=;
if kindex(line, '2010'x)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 HYPHEN char: ' lbtest=;
if kindex(line, '2011'x)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 NON-BREAKING HYPHEN char: ' lbtest=;
These search do find UNEXPECTED transcoded/corrupted chars:
*--- https://www.w3schools.com/charsets/ref_html_utf8.asp
THESE DO MATCH, BUT SHOULD NOT - TRANSCODED CHARS, NOT THE ORIGINAL CHARS ;
if kindex(line, 'E2'x)
then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 â char: ' lbtest=;
if kindex(line, '80'x)
then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 € char: ' lbtest=;
if kindex(line, '91'x)
then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 ‘ char: ' lbtest=;
I am unable to resolve or make sense of this. It would be quite nice if this were somehow more transparent.
And just to add that pointing to the alternative UTF-8 config file (-ENCODING UTF-8) does not help.
That was the guidance here: http://support.sas.com/kb/51/586.html
This discussion might interest you
https://www.pharmasug.org/proceedings/2018/BB/PharmaSUG-2018-BB08.pdf
Well, I'm stubborn, and I don't like to give up, so in case it helps anyone else, here is my summary of successfully working with UTF-8 characters.
Much of what ChrisNZ suggested is very helpful, and contributed to finally sorting this out.
Steps to check when working with UTF-8 on in SAS 9.4 Windows Server 2016 / Win10 :
Good luck 😉
Excellent summary.
About: 4. Check that the config file sets: -ENCODING UTF-8
This is done by running proc options group=languagecontrol; run;
As promised, above, buggy or at least unreliable K-functions.
KCOMPRESS() should have a similar interface, including modifiers, as COMPRESS()
Unfortunately, it does not actually accept modifiers:
data _null_;
length str noletters nonumbers $50;
do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
putlog 'ORIGINAL: ' str= str=$hex24. ;
nonumbers = strip(kcompress(str, ,'d'));
putlog ' No numbers: ' nonumbers= str=$hex24. ;
noletters = strip(kcompress(str, ,'dk'));
putlog ' No letters: ' noletters= str=$hex24. ;
end;
run;
This throws errors:
1 nonumbers = strip(kcompress(str, ,'d')); --------- 72 ERROR 72-185: The KCOMPRESS function call has too many arguments.
Brute force still works:
data _null_;
length str noletters nonumbers $50;
do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
putlog 'ORIGINAL: ' str= str=$hex24. ;
nonumbers = strip(kcompress(str, '0123456789'));
putlog ' No numbers: ' nonumbers= str=$hex24. ;
noletters = strip(kcompress(str, 'abcdefghijklmnopqrstuvwxyz'));
putlog ' No letters: ' noletters= str=$hex24. ;
end;
run;
Although even in UTF-8 session, the log cannot display UTF-8 characters, which is rather unhelpful.
ORIGINAL: str=label123 str=6C6162656C31323320202020 No numbers: nonumbers=label str=6C6162656C31323320202020 No letters: noletters=123 str=6C6162656C31323320202020 ORIGINAL: str=α1β2γ3δ str=CEB131CEB232CEB333CEB420 No numbers: nonumbers=αβγδ str=CEB131CEB232CEB333CEB420 No letters: noletters=α1β2γ3δ str=CEB131CEB232CEB333CEB420 ORIGINAL: str=α1β2γ3δ4 str=CEB131CEB232CEB333CEB434 No numbers: nonumbers=αβγδ str=CEB131CEB232CEB333CEB434 No letters: noletters=α1β2γ3δ4 str=CEB131CEB232CEB333CEB434
Nothing buggy or unreliable about the K-functions.
They support fewer features that's all.
And it's easy to understand why.
For example what's a number or a letter in UTF-8?
Is δ a letter? And ﻌ ? What about consonant clusters used in Korean, such as ㄵ ?
Is ٣ a number? or ፵ ?
What about Hebrew numbers? Fifteen can be ט״ו or י״ה while the second of these 3 characters is not a number.
It gets really complicated really quickly when you want to support all the writing systems in the world.
This explains why for now, some options are left out for the multi-byte functions.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.