Solved: Re: encoding for proc http to sas7bdat, without corrupting characters

GGO · Posted 02-21-2020 12:03 PM

I've searched to communities for solutions to this, or at least explanations, but find none.

Any insights would be helpful.

Here is example code, to illustrate the problems.

Objective: Read a UTF-8 encoded page; parse it to sas variables; and store as sas7bdat ... without corrupting UTF-8 chars.

Example code.

%let w_encoding = UTF-8;
%let r_encoding = UTF-8;

filename WRITE './unitslab_conversions.txt' encoding="&W_ENCODING";
filename  READ './unitslab_conversions.txt' encoding="&R_ENCODING";

libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";

proc http
  method="GET"
  url="https://unitslab.com/"
  out=WRITE ;
run;

data data.lbtests;
  infile READ length=len lrecl=32767;
  input line $varying32767. len;

  line = strip(line);

  if prxmatch('/^<li>.+\/node/i', line);
  if prxmatch('/(microglobulin|cancer|beta|kappa|mass|mullerian)/i', line);

  lbtest = strip(scan(line,5,'<>'));
  node   = strip(scan(line,2,'"'));

*--- Paste in strings from the UTF-8 page 
     NB - NONE OF THESE MATCH ;
  if    index(lbtest, 'anti-Mullerian')
     or index(lbtest, 'Beta 2-microglobulin (ß2-M)')
     or index(lbtest, 'Free ß-subunit of human chorionic gonadotropin (free ßhCG)')
     or index(lbtest, 'Kappa (κ)')
     then putlog 'INFO: FOUND EXPECTED string match: ' lbtest=;

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     NB - LIMITED CHAR RANGE ALSO MATCH TRANSCODE (CORRUPTED) chars above \x{2000} ;
  if prxmatch('/[\x{03b2}\x{052f}]/', line)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 char: ' lbtest=;

  keep lbtest node;
run;

Problems with results:

1 - SAS log - Notice the corrupted UTF-8 chars - I've bolded some UTF-8 hyphens and greek chars:

lbtest=alpha-1â€‘microglobulin node=/node/89
lbtest=antiâ€Mullerian hormone (AMH) node=/node/155
lbtest=beta - CrossLaps - Degradation products of type I collagen node=/node/164
lbtest=Beta 2â€microglobulin (Î²2â€M) node=/node/145
lbtest=beta-Hydroxybutyric acid node=/node/225
lbtest=CA 125 (Cancer Antigen 125) node=/node/104
lbtest=CA 15-3 (Cancer Antigen 15-3) node=/node/105
lbtest=CA 72-4 (AntigÃ¨ne de cancer 72-4) node=/node/107
lbtest=CKâ€MB mass - the MB isoenzyme of creatine kinase (quantitative determination) node=/node/157
lbtest=Kappa (Îº) light chain node=/node/150

2 - Similar in the resulting SAS7BDAT, despite encoding being UTF-8, but different corruption, appearing to eat up more char bytes:

3 - Similar result viewing SAS7BDAT outside a SAS session - UNIVERSAL VIEWER, but still different corruption of UTF-8 chars:

Thanks for any insights, tips or fixes to this code that can avoid this UTF-8 character corruption.

GG

ChrisNZ · Posted 02-23-2020 10:11 PM

See here to start your session:

http://support.sas.com/kb/51/586.html

Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 02-22-2020 08:00 PM

I might be wrong but I don't think what you are showing means there is corruption.

It might be that the data is fine but the viewers don't support UTF8.

What's the encoding of your SAS environment?

Can you see the txt files in a proper viewer such as notepad++?

High-Performance SAS Coding - Third Edition

GGO · Posted 02-23-2020 02:33 PM

Thanks, Chris - Unfortunately, it is not only a display problem.

Bottom line: SAS corrupts characters in the HTML5 UTF-8 range above the Greek/Cyrillic chars.

Reference for char hex values: w3schools page HTML UTF-8 encoding.

Notepad++ preserves but fails to display HTML UTF-8 chars in the range \x{2000}-\x{27bf}

According to the W3 HTML UTF-8 page, above, these are all chars after "Cyrillic Supplement"
But at least Notepad++ preserves the correct chars, despite display problems - regex [\x{2000}-\x{27bf}]

By comparison, SAS 9.04 corrupts chars at least above \x{052F}.

I've updated the code above to demonstrate this.

The resulting log file snippet follows. SAS and Notepad++ have similar display problems:

NOTE: The infile READ is:
Filename=unitslab_conversions.txt,
RECFM=V,LRECL=131068,File Size (bytes)=42931,
Last Modified=23Feb2020:10:51:35,
Create Time=21Feb2020:08:29:19

INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=alpha-1â€‘microglobulin
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=antiâ€Mullerian hormone (AMH)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta - CrossLaps - Degradation products of type I collagen
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Beta 2â€microglobulin (Î²2â€M)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=beta-Hydroxybutyric acid
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 125 (Cancer Antigen 125)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 15-3 (Cancer Antigen 15-3)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CA 72-4 (AntigÃ¨ne de cancer 72-4)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=CKâ€MB mass - the MB isoenzyme of creatine kinase (quantitative determination)
INFO: FOUND EXPECTED HTML5 UTF-8 char: lbtest=Kappa (Îº) light chain

Note that the prxmatch not only matches on the Greek letters (small beta and kappa) but also on the hypens \x{2010} and \x{2011}.

SAS has corrupted these UTF-8 chars into something in the Greek/Cyrillic range.

ChrisNZ · Posted 02-23-2020 09:03 PM

You are raising two different issues here:

1. Matching strings

UTF-8 is I18N Level 2. Not all SAS functions support this. See here (it's a Viya link but is true for SAS too) :

https://documentation.sas.com/?docsetId=nlsref&docsetTarget=p1pca7vwjjwucin178l8qddjn0gi.htm&docsetV...

2. Displaying strings.

You haven't said what encoding your SAS session uses.

3. How do the contents of filerefs READ and WRITE compare in terms of encoded values?

High-Performance SAS Coding - Third Edition

GGO · Posted 02-23-2020 09:34 PM

Thanks for that reference, Chris - I'll have to review (Internationalization).

I suspect that this will not explain why the approach in my code works for some UTF-8 range beyond ASCII, but not the entire HTML5 UTF-8 range.

Your point (3) is also a clever test - I'll take a look at this, as well.

I've set my session encoding at start-up: -encoding ASCIIANY.

I tried forcing (SBCS) sessions to -encoding "UTF-8", which according to documentation should be valid, although note that utf-8 is at best an after-thought on that page (same for UNIX). But SAS fails to launch with that setting ("invalid" encoding value).

I tried various combinations of DBCS, ENCODING, DBCSTYPE, DBCSLANG - all result in some sort of failure message, failure to start session, so no further testing in this direction is possible in my work environment.

Much appreciated!

ChrisNZ · Posted 02-23-2020 10:11 PM

See here to start your session:

http://support.sas.com/kb/51/586.html

Note that UTF-8 is not DBCS (double-byte), it is MBCS (multi-byte).

High-Performance SAS Coding - Third Edition

GGO · Posted 02-23-2020 11:28 PM

Again, Chris: Much appreciated, all of your nuggets of insight!

ChrisNZ · Posted 02-24-2020 03:02 AM

So all is working now?

High-Performance SAS Coding - Third Edition

GGO · Posted 02-25-2020 02:21 PM

Unfortunately I have not been able to fully resolve however SAS is handling HTML5 UTF-8 chars.

Step by step - this is what works, and this is what I cannot get past in a session with -ENCODING ASCIIANY:

1 - As Chris suggested, SAS is preserving HTML UTF-8 chars in the read/write process.

In the following snippet (variation on above code) files "unitslab_conversions.txt" and "unitslab_conv2.txt" match exactly.

Note: TERMSTR=LF overrides default behaviour in my Win environment, to preserve UNIX line feeds (LF)

%let w_encoding = UTF-8;
%let r_encoding = UTF-8;

filename WRITE  './unitslab_conversions.txt' encoding="&W_ENCODING";
filename  READ  './unitslab_conversions.txt' encoding="&R_ENCODING";
filename WRITE2 './unitslab_conv2.txt'       encoding="&W_ENCODING" TERMSTR=LF ;

libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";

proc http
  method = 'GET'
     url = 'https://unitslab.com/'
     out = WRITE ;
run;

data _null_;
  infile READ length=len lrecl=32767;
  input;

  file WRITE2 lrecl=32767; 
  put _infile_;
run;

2 - However, within the string-searching / matching data step, above, even using K-functions do not correctly find HTML5 UTF-8 chars in those text files. They DO find what seem to be corrupted chars. EG, instead of HTML UTF-8 HYPHEN char \x{2011}, K-functions find the corrupted replacement string 'E28091'x, which matches the corrupted display chars "â€‘".

These searches find none of the expected characters:

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     THESE DO NOT MATCH ;
  if kindex(line, '03B2'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 BETA char: ' lbtest=;
  if kindex(line, '03BA'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 KAPPA char: ' lbtest=;
  if kindex(line, '2010'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 HYPHEN char: ' lbtest=;
  if kindex(line, '2011'x)
     then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 NON-BREAKING HYPHEN char: ' lbtest=;

These search do find UNEXPECTED transcoded/corrupted chars:

*--- https://www.w3schools.com/charsets/ref_html_utf8.asp 
     THESE DO MATCH, BUT SHOULD NOT - TRANSCODED CHARS, NOT THE ORIGINAL CHARS ;
  if kindex(line, 'E2'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 â char: ' lbtest=;
  if kindex(line, '80'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 € char: ' lbtest=;
  if kindex(line, '91'x)
     then putlog 'INFO: FOUND UNEXPECTED HTML5 UTF-8 ‘ char: ' lbtest=;

I am unable to resolve or make sense of this. It would be quite nice if this were somehow more transparent.

GGO · Posted 02-25-2020 02:35 PM

And just to add that pointing to the alternative UTF-8 config file (-ENCODING UTF-8) does not help.

That was the guidance here: http://support.sas.com/kb/51/586.html

ChrisNZ · Posted 02-24-2020 03:44 AM

This discussion might interest you

https://www.pharmasug.org/proceedings/2018/BB/PharmaSUG-2018-BB08.pdf

High-Performance SAS Coding - Third Edition

GGO · Posted 03-21-2020 02:50 PM

Well, I'm stubborn, and I don't like to give up, so in case it helps anyone else, here is my summary of successfully working with UTF-8 characters.

Much of what ChrisNZ suggested is very helpful, and contributed to finally sorting this out.

Steps to check when working with UTF-8 on in SAS 9.4 Windows Server 2016 / Win10 :

Make sure your SAS session is set for UTF-8 encoding. This should be set in your session config, or command-line string (shortcut target) via the -CONFIG system option.
The correct config file is typically: -CONFIG "C:\Program Files\SASHome\SASFoundation\9.4\nls\u8\sasv9.cfg"
SAS 9.4 installs an application menu shortcut "SAS 9.4 (Unicode Support)" that should point you to that config file.
Check that the config file sets: -ENCODING UTF-8
Get to know that setting well, since you can force read/write encoding on libname and filename statements to prevent unwanted transcoding of UTF-8 chars
LIBNAME options: SAS 9.4 provides both INENCODING and OUTENCODING settings
FILENAME options: read that SAS 9.4 documentation, which mentions that encoding setting, above.
Get to know your encoding options - SAS 9.4 Encoding Values in SAS Language Elements
Get to know SBCS, DBCS, MBCS and the SAS 9.4 Internationalization Compatibility for SAS String Functions
Note: prxmatch functions (which I started with at top) only support SBCS. So find another approach when working with DBCS, MBCS
Note: But also note that if you are certain that you are working with a section of the source that does not contain any DB or MB chars, then you may just get away with using the SB-only functions 🤞 🙂
Note: some K-functions are buggy (Sorry, SAS - I'll back that up with example code, below)
That Internationalization page gives examples of how to search for hex literals in a UTF-8 session - syntax like "<hex-code>"x
You'll have to understand Character Constants Expressed in Hexadecimal Notation
The final piece is actually knowing the Hex Code for characters of interest. The best reference that I can find is: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=0&number=1024 ... but it is hard to search, since there are lots of UTF-8 chars 🙂

Good luck 😉

ChrisNZ · Posted 03-21-2020 06:07 PM

Excellent summary.

About: 4. Check that the config file sets: -ENCODING UTF-8

This is done by running proc options group=languagecontrol; run;

High-Performance SAS Coding - Third Edition

GGO · Posted 03-21-2020 03:15 PM

As promised, above, buggy or at least unreliable K-functions.

KCOMPRESS() should have a similar interface, including modifiers, as COMPRESS()

Unfortunately, it does not actually accept modifiers:

data _null_;
  length str noletters nonumbers $50;
  do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
    putlog 'ORIGINAL: '     str= str=$hex24. ;
    nonumbers = strip(kcompress(str, ,'d'));
    putlog ' No numbers: ' nonumbers= str=$hex24. ;
    noletters = strip(kcompress(str, ,'dk'));
    putlog ' No letters: ' noletters= str=$hex24. ;
  end;
run;

This throws errors:

1      nonumbers = strip(kcompress(str, ,'d'));
                         ---------
                         72
ERROR 72-185: The KCOMPRESS function call has too many arguments.

Brute force still works:

data _null_;
  length str noletters nonumbers $50;
  do str = 'label123', 'α1β2γ3δ', 'ceb131ceb232ceb333ceb434'x;
    putlog 'ORIGINAL: '     str= str=$hex24. ;
    nonumbers = strip(kcompress(str, '0123456789'));
    putlog ' No numbers: ' nonumbers= str=$hex24. ;
    noletters = strip(kcompress(str, 'abcdefghijklmnopqrstuvwxyz'));
    putlog ' No letters: ' noletters= str=$hex24. ;
  end;
run;

Although even in UTF-8 session, the log cannot display UTF-8 characters, which is rather unhelpful.

ORIGINAL: str=label123 str=6C6162656C31323320202020
 No numbers: nonumbers=label str=6C6162656C31323320202020
 No letters: noletters=123 str=6C6162656C31323320202020
ORIGINAL: str=Î±1Î²2Î³3Î´ str=CEB131CEB232CEB333CEB420
 No numbers: nonumbers=Î±Î²Î³Î´ str=CEB131CEB232CEB333CEB420
 No letters: noletters=Î±1Î²2Î³3Î´ str=CEB131CEB232CEB333CEB420
ORIGINAL: str=Î±1Î²2Î³3Î´4 str=CEB131CEB232CEB333CEB434
 No numbers: nonumbers=Î±Î²Î³Î´ str=CEB131CEB232CEB333CEB434
 No letters: noletters=Î±1Î²2Î³3Î´4 str=CEB131CEB232CEB333CEB434

ChrisNZ · Posted 03-21-2020 06:27 PM

Nothing buggy or unreliable about the K-functions.

They support fewer features that's all.

And it's easy to understand why.

For example what's a number or a letter in UTF-8?

Is δ a letter? And ﻌ ? What about consonant clusters used in Korean, such as ㄵ ?

Is ٣ a number? or ፵ ?

What about Hebrew numbers? Fifteen can be ט״ו or י״ה while the second of these 3 characters is not a number.

It gets really complicated really quickly when you want to support all the writing systems in the world.

This explains why for now, some options are left out for the multi-byte functions.

High-Performance SAS Coding - Third Edition

Classroom Training Available!