I've been puzzling over this post (Case-statement-trims-length-to-200) by @ChrisNZ - and the weirdness continues.
Before I found Chris's extensive analysis of what was going on with his problem, I managed to get a Perl expression to work at more than 200 bytes. But I had to jump through hoops to do it. And I still don't know quite what's going on.
A few of our source files have 32k-byte HTML text in them, which requires decoding and tag stripping. Not only that, but the often contain tabs, carriage returns and the like - all of which are a barrier for textual analysis.
My original code was
strip(compbl(translate(prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))), ' ', '08090a0b0c0d'x))) as message length=32767
It was unwieldy, but it worked. (htmldecode is called twice, because sometimes that's the only way of getting rid of the special characters - I don't control the source data!) And then I looked at the output and saw that no value was longer than 200 bytes - often less because of utf-8.
But cutting out one function at a time, I found that prxchange was working as expected. So I created an extra work column and added one more function at a time, and it still worked.
Here's what I've ended up with:
prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))) as untagged_message length=32767,
strip(compbl(translate(calculated untagged_message, ' ', '08090a0b0c0d'x))) as message length=32767 /* six spaces between the quotes */
I have no idea why splitting it into two expressions and columns works. Does anybody have any suggestions?
Yes, this whole thing is most messy for no good reason at all, and needs (needed? maybe it's too late) a good clean up.
The data step behaves differently, and results are as expected. The SQL interpreter is the confused party here.
data T1;
length X Y1 Z1 $32000;
X=repeat('1234<zxcv> '||'090A'x,200);
Y1= prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))) ;
Z1=compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))));
run;
proc sql;
create table T2 as
select X
, prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))) as Y2 length=32000
, compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))) as Z2 length=32000
from T1;
quit;
data ALL;
merge T1 T2;
LY1=length(Y1);
LY2=length(Y2);
LZ1=length(Z1);
LZ2=length(Z2);
putlog LY1= LY2= LZ1= LZ2= ;
run;
LY1=1604 LY2=1604 LZ1=1004 LZ2=124
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.