I've searched to communities for solutions to this, or at least explanations, but find none. Any insights would be helpful. Here is example code, to illustrate the problems. Objective: Read a UTF-8 encoded page; parse it to sas variables; and store as sas7bdat ... without corrupting UTF-8 chars. Example code. %let w_encoding = UTF-8;
%let r_encoding = UTF-8;
filename WRITE './unitslab_conversions.txt' encoding="&W_ENCODING";
filename READ './unitslab_conversions.txt' encoding="&R_ENCODING";
libname data './data' inencoding="&R_ENCODING" outencoding="&W_ENCODING";
proc http
method="GET"
url="https://unitslab.com/"
out=WRITE ;
run;
data data.lbtests;
infile READ length=len lrecl=32767;
input line $varying32767. len;
line = strip(line);
if prxmatch('/^<li>.+\/node/i', line);
if prxmatch('/(microglobulin|cancer|beta|kappa|mass|mullerian)/i', line);
lbtest = strip(scan(line,5,'<>'));
node = strip(scan(line,2,'"'));
*--- Paste in strings from the UTF-8 page
NB - NONE OF THESE MATCH ;
if index(lbtest, 'anti-Mullerian')
or index(lbtest, 'Beta 2-microglobulin (ß2-M)')
or index(lbtest, 'Free ß-subunit of human chorionic gonadotropin (free ßhCG)')
or index(lbtest, 'Kappa (κ)')
then putlog 'INFO: FOUND EXPECTED string match: ' lbtest=;
*--- https://www.w3schools.com/charsets/ref_html_utf8.asp
NB - LIMITED CHAR RANGE ALSO MATCH TRANSCODE (CORRUPTED) chars above \x{2000} ;
if prxmatch('/[\x{03b2}\x{052f}]/', line)
then putlog 'INFO: FOUND EXPECTED HTML5 UTF-8 char: ' lbtest=;
keep lbtest node;
run; Problems with results: 1 - SAS log - Notice the corrupted UTF-8 chars - I've bolded some UTF-8 hyphens and greek chars: lbtest=alpha-1‑microglobulin node=/node/89
lbtest=antiâ€Mullerian hormone (AMH) node=/node/155
lbtest=beta - CrossLaps - Degradation products of type I collagen node=/node/164
lbtest=Beta 2â€microglobulin (β2â€M) node=/node/145
lbtest=beta-Hydroxybutyric acid node=/node/225
lbtest=CA 125 (Cancer Antigen 125) node=/node/104
lbtest=CA 15-3 (Cancer Antigen 15-3) node=/node/105
lbtest=CA 72-4 (Antigène de cancer 72-4) node=/node/107
lbtest=CKâ€MB mass - the MB isoenzyme of creatine kinase (quantitative determination) node=/node/157
lbtest=Kappa (κ) light chain node=/node/150 2 - Similar in the resulting SAS7BDAT, despite encoding being UTF-8, but different corruption, appearing to eat up more char bytes: 3 - Similar result viewing SAS7BDAT outside a SAS session - UNIVERSAL VIEWER, but still different corruption of UTF-8 chars: Thanks for any insights, tips or fixes to this code that can avoid this UTF-8 character corruption. GG
... View more