BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

I've searched to communities for solutions to this, or at least explanations, but find none.

Any insights would be helpful.

 

Here is example code, to illustrate the problems.

 

Objective: Read a UTF-8 encoded page; parse it to sas variables; and store as sas7bdat ... without corrupting UTF-8 chars.

 

Example code.

%let w_encoding = UTF-8;
%let r_encoding = UTF-8;

filename WRITE './unitslab_conversions.txt' encoding="&W_ENCODING";
filename  READ './unitslab_conversions.txt' encoding="&R_ENCODING";

libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";

proc http
  method="GET"
  url="https://unitslab.com/"
  out=WRITE ;
run;

data data.lbtests;
  infile READ length=len lrecl=32767;
  input line $varying32767. len;

  line = strip(line);

  if prxmatch('/^<li>.+\/node/i', line);
  if prxmatch('/(microglobulin|cancer|beta|kappa|mass|mullerian)/i', line);

  lbtest = strip(scan(line,5,'<>'));
  node   = strip(scan(line,2,'"'));

*--- Paste in strings from the UTF-8 page 
     NB - NONE OF THESE MATCH ;
  if    index(lbtest, 'anti-Mullerian')
     or index(lbtest, 'Beta 2-microglobulin (ß2-M)')
     or index(lbtest, 'Free ß-subunit of human chorionic gonadotropin (free ßhCG)')
     or index(lbtest, 'Kappa (κ)')
     then putlog 'INFO: FOUND EXPECTED string match: ' lbtest=;

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     NB - LIMITED CHAR RANGE ALSO MATCH TRANSCODE (CORRUPTED) chars above \x{2000} ;
  if prxmatch('/[\x{03b2}\x{052f}]/', line)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 char: ' lbtest=;

  keep lbtest node;
run;

 

Problems with results:

 

1 - SAS log - Notice the corrupted UTF-8 chars - I've bolded some UTF-8 hyphens and greek chars:

 

lbtest=alpha-1‑microglobulin node=/node/89
lbtest=anti‐Mullerian hormone (AMH) node=/node/155
lbtest=beta - CrossLaps - Degradation products of type I collagen node=/node/164
lbtest=Beta 2‐microglobulin (β2‐M) node=/node/145
lbtest=beta-Hydroxybutyric acid node=/node/225
lbtest=CA 125 (Cancer Antigen 125) node=/node/104
lbtest=CA 15-3 (Cancer Antigen 15-3) node=/node/105
lbtest=CA 72-4 (Antigène de cancer 72-4) node=/node/107
lbtest=CK‐MB mass - the MB isoenzyme of creatine kinase (quantitative determination) node=/node/157
lbtest=Kappa (κ) light chain node=/node/150

 

2 - Similar in the resulting SAS7BDAT, despite encoding being UTF-8, but different corruption, appearing to eat up more char bytes:

 

sas7bdat.PNG

 

 

3 - Similar result viewing SAS7BDAT outside a SAS session - UNIVERSAL VIEWER, but still different corruption of UTF-8 chars:

 

uv_sas7bdat.PNG

 

Thanks for any insights, tips or fixes to this code that can avoid this UTF-8 character corruption.

GG

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

See here to start your session:

http://support.sas.com/kb/51/586.html

 

Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).

View solution in original post

22 REPLIES 22
ChrisNZ
Tourmaline | Level 20

I might be wrong but I don't think what you are showing means there is corruption.

It might be that the data is fine but the viewers don't support UTF8.

What's the encoding of your SAS environment?

Can you see the txt files in a proper viewer such as notepad++?

 

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Thanks, Chris - Unfortunately, it is not only a display problem.

 

Bottom line: SAS corrupts characters in the HTML5 UTF-8 range above the Greek/Cyrillic chars.

 

Reference for char hex values: w3schools page HTML UTF-8 encoding.

 

Notepad++ preserves but fails to display HTML UTF-8 chars in the range \x{2000}-\x{27bf}

  • According to the W3 HTML UTF-8 page, above, these are all chars after "Cyrillic Supplement"
  • But at least Notepad++ preserves the correct chars, despite display problems - regex [\x{2000}-\x{27bf}]
  • npp-utf-8-display-BAD.png

 

By comparison, SAS 9.04 corrupts chars at least above \x{052F}.

I've updated the code above to demonstrate this.

 

The resulting log file snippet follows. SAS and Notepad++ have similar display problems:

 

NOTE: The infile READ is:
Filename=unitslab_conversions.txt,
RECFM=V,LRECL=131068,File Size (bytes)=42931,
Last Modified=23Feb2020:10:51:35,
Create Time=21Feb2020:08:29:19

INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=alpha-1‑microglobulin
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=anti‐Mullerian hormone (AMH)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta - CrossLaps - Degradation products of type I collagen
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Beta 2‐microglobulin (β2‐M)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta-Hydroxybutyric acid
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 125 (Cancer Antigen 125)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 15-3 (Cancer Antigen 15-3)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 72-4 (Antigène de cancer 72-4)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CK‐MB mass - the MB isoenzyme of creatine kinase (quantitative determination)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Kappa (κ) light chain

 

Note that the prxmatch not only matches on the Greek letters (small beta and kappa) but also on the hypens \x{2010} and \x{2011}.

 

SAS has corrupted these UTF-8 chars into something in the Greek/Cyrillic range.

ChrisNZ
Tourmaline | Level 20

You are raising two different issues here:

 

1. Matching strings

UTF-8 is  I18N Level 2. Not all SAS functions support this. See here (it's a Viya link but is true for SAS too) :

https://documentation.sas.com/?docsetId=nlsref&docsetTarget=p1pca7vwjjwucin178l8qddjn0gi.htm&docsetV...

 

2. Displaying strings.

You haven't said what encoding your SAS session uses.

 

3. How do the contents of filerefs READ and WRITE compare in terms of encoded values?

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Thanks for that reference, Chris - I'll have to review (Internationalization).

 

I suspect that this will not explain why the approach in my code works for some UTF-8 range beyond ASCII, but not the entire HTML5 UTF-8 range.

 

Your point (3) is also a clever test - I'll take a look at this, as well.

 

I've set my session encoding at start-up: -encoding ASCIIANY.

 

I tried forcing (SBCS) sessions to -encoding "UTF-8", which according to documentation should be valid, although note that utf-8 is at best an after-thought on that page (same for UNIX). But SAS fails to launch with that setting ("invalid" encoding value).

 

I tried various combinations of DBCS, ENCODING, DBCSTYPE, DBCSLANG - all result in some sort of failure message, failure to start session, so no further testing in this direction is possible in my work environment.

 

Much appreciated!

ChrisNZ
Tourmaline | Level 20

See here to start your session:

http://support.sas.com/kb/51/586.html

 

Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Again, Chris: Much appreciated, all of your nuggets of insight!

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Unfortunately I have not been able to fully resolve however SAS is handling HTML5 UTF-8 chars.

 

Step by step - this is what works, and this is what I cannot get past in a session with -ENCODING ASCIIANY:

 

1 - As Chris suggested, SAS is preserving HTML UTF-8 chars in the read/write process.

In the following snippet (variation on above code) files "unitslab_conversions.txt" and "unitslab_conv2.txt" match exactly.

Note: TERMSTR=LF overrides default behaviour in my Win environment, to preserve UNIX line feeds (LF)

 

%let w_encoding = UTF-8;
%let r_encoding = UTF-8;

filename WRITE  './unitslab_conversions.txt' encoding="&W_ENCODING";
filename  READ  './unitslab_conversions.txt' encoding="&R_ENCODING";
filename WRITE2 './unitslab_conv2.txt'       encoding="&W_ENCODING" TERMSTR=LF ;

libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";

proc http
  method = 'GET'
     url = 'https://unitslab.com/'
     out = WRITE ;
run;

data _null_;
  infile READ length=len lrecl=32767;
  input;

  file WRITE2 lrecl=32767; 
  put _infile_;
run;

2 - However, within the string-searching / matching data step, above, even using K-functions do not correctly find HTML5 UTF-8 chars in those text files. They DO find what seem to be corrupted chars. EG, instead of HTML UTF-8 HYPHEN char \x{2011}, K-functions find the corrupted replacement string 'E28091'x, which matches the corrupted display chars "‑".

 

These searches find none of the expected characters:

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     THESE DO NOT MATCH ;
  if kindex(line, '03B2'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 BETA char: ' lbtest=;
  if kindex(line, '03BA'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 KAPPA char: ' lbtest=;
  if kindex(line, '2010'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 HYPHEN char: ' lbtest=;
  if kindex(line, '2011'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 NON-BREAKING HYPHEN char: ' lbtest=;

These search do find UNEXPECTED transcoded/corrupted chars:

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     THESE DO MATCH, BUT SHOULD NOT - TRANSCODED CHARS, NOT THE ORIGINAL CHARS ;
  if kindex(line, 'E2'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 â char: ' lbtest=;
  if kindex(line, '80'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 € char: ' lbtest=;
  if kindex(line, '91'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 ‘ char: ' lbtest=;

I am unable to resolve or make sense of this. It would be quite nice if this were somehow more transparent.

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

And just to add that pointing to the alternative UTF-8 config file (-ENCODING UTF-8) does not help.

 

That was the guidance here: http://support.sas.com/kb/51/586.html

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Well, I'm stubborn, and I don't like to give up, so in case it helps anyone else, here is my summary of successfully working with UTF-8 characters.

 

Much of what ChrisNZ suggested is very helpful, and contributed to finally sorting this out.

 

Steps to check when working with UTF-8 on in SAS 9.4 Windows Server 2016 / Win10 :

  1. Make sure your SAS session is set for UTF-8 encoding. This should be set in your session config, or command-line string (shortcut target) via the -CONFIG system option.
  2. The correct config file is typically: -CONFIG "C:\Program Files\SASHome\SASFoundation\9.4\nls\u8\sasv9.cfg"
  3. SAS 9.4 installs an application menu shortcut "SAS 9.4 (Unicode Support)" that should point you to that config file.
  4. Check that the config file sets: -ENCODING UTF-8
  5. Get to know that setting well, since you can force read/write encoding on libname and filename statements to prevent unwanted transcoding of UTF-8 chars
  6. LIBNAME options: SAS 9.4 provides both INENCODING and OUTENCODING settings
  7. FILENAME options: read that SAS 9.4 documentation, which mentions that encoding setting, above.
  8. Get to know your encoding options - SAS 9.4 Encoding Values in SAS Language Elements
  9. Get to know SBCS, DBCS, MBCS and the SAS 9.4 Internationalization Compatibility for SAS String Functions
  10. Note: prxmatch functions (which I started with at top) only support SBCS. So find another approach when working with DBCS, MBCS
  11. Note: But also note that if you are certain that you are working with a section of the source that does not contain any DB or MB chars, then you may just get away with using the SB-only functions 🤞 🙂
  12. Note: some K-functions are buggy (Sorry, SAS - I'll back that up with example code, below)
  13. That Internationalization page gives examples of how to search for hex literals in a UTF-8 session - syntax like "<hex-code>"x
  14. You'll have to understand Character Constants Expressed in Hexadecimal Notation 
  15. The final piece is actually knowing the Hex Code for characters of interest. The best reference that I can find is: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=0&number=1024 ... but it is hard to search, since there are lots of UTF-8 chars 🙂

Good luck 😉

ChrisNZ
Tourmaline | Level 20

Excellent summary.

 

About:    4. Check that the config file sets: -ENCODING UTF-8

This is done by running proc options group=languagecontrol; run;

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

As promised, above, buggy or at least unreliable K-functions.

 

KCOMPRESS() should have a similar interface, including modifiers, as COMPRESS()

 

Unfortunately, it does not actually accept modifiers:

 

 

data _null_;
  length str noletters nonumbers $50;
  do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
    putlog 'ORIGINAL: '     str= str=$hex24. ;
    nonumbers = strip(kcompress(str, ,'d'));
    putlog ' No numbers: ' nonumbers= str=$hex24. ;
    noletters = strip(kcompress(str, ,'dk'));
    putlog ' No letters: ' noletters= str=$hex24. ;
  end;
run;

This throws errors:

 

1      nonumbers = strip(kcompress(str, ,'d'));
                         ---------
                         72
ERROR 72-185: The KCOMPRESS function call has too many arguments.

Brute force still works:

data _null_;
  length str noletters nonumbers $50;
  do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
    putlog 'ORIGINAL: '     str= str=$hex24. ;
    nonumbers = strip(kcompress(str, '0123456789'));
    putlog ' No numbers: ' nonumbers= str=$hex24. ;
    noletters = strip(kcompress(str, 'abcdefghijklmnopqrstuvwxyz'));
    putlog ' No letters: ' noletters= str=$hex24. ;
  end;
run;

Although even in UTF-8 session, the log cannot display UTF-8 characters, which is rather unhelpful.

ORIGINAL: str=label123 str=6C6162656C31323320202020
 No numbers: nonumbers=label str=6C6162656C31323320202020
 No letters: noletters=123 str=6C6162656C31323320202020
ORIGINAL: str=α1β2γ3δ str=CEB131CEB232CEB333CEB420
 No numbers: nonumbers=αβγδ str=CEB131CEB232CEB333CEB420
 No letters: noletters=α1β2γ3δ str=CEB131CEB232CEB333CEB420
ORIGINAL: str=α1β2γ3δ4 str=CEB131CEB232CEB333CEB434
 No numbers: nonumbers=αβγδ str=CEB131CEB232CEB333CEB434
 No letters: noletters=α1β2γ3δ4 str=CEB131CEB232CEB333CEB434
ChrisNZ
Tourmaline | Level 20

Nothing buggy or unreliable about the K-functions.

They support fewer features that's all.

And it's easy to understand why.

For example what's a number or a letter in UTF-8?

Is δ a letter? And  ? What about consonant clusters used in Korean, such as  ?

Is ٣ a number? or ?

What about Hebrew numbers? Fifteen can be ט״ו or י״ה  while the second of these 3 characters is not a number.

It gets really complicated really quickly when you want to support all the writing systems in the world.

This explains why for now, some options are left out for the multi-byte functions.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 2171 views
  • 4 likes
  • 3 in conversation