BookmarkSubscribeRSS Feed
LaurieF
Barite | Level 11

I've been puzzling over this post (Case-statement-trims-length-to-200) by @ChrisNZ - and the weirdness continues.

 

Before I found Chris's extensive analysis of what was going on with his problem, I managed to get a Perl expression to work at more than 200 bytes. But I had to jump through hoops to do it. And I still don't know quite what's going on.

 

A few of our source files have 32k-byte HTML text in them, which requires decoding and tag stripping. Not only that, but the often contain tabs, carriage returns and the like - all of which are a barrier for textual analysis.

 

My original code was

strip(compbl(translate(prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))), '     ', '08090a0b0c0d'x))) as message length=32767

It was unwieldy, but it worked. (htmldecode is called twice, because sometimes that's the only way of getting rid of the special characters - I don't control the source data!) And then I looked at the output and saw that no value was longer than 200 bytes - often less because of utf-8.

 

But cutting out one function at a time,  I found that prxchange was working as expected. So I created an extra work column and added one more function at a time, and it still worked.

 

Here's what I've ended up with:

prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))) as untagged_message length=32767,
strip(compbl(translate(calculated untagged_message, '     ', '08090a0b0c0d'x))) as message length=32767 /* six spaces between the quotes */

 I have no idea why splitting it into two expressions and columns works. Does anybody have any suggestions?

1 REPLY 1
ChrisNZ
Tourmaline | Level 20

Yes, this whole thing is most messy for no good reason at all, and needs (needed? maybe it's too late) a good clean up.

The data step behaves differently, and results are as expected. The SQL interpreter is the confused party here.

data T1;
 length X Y1 Z1 $32000;
 X=repeat('1234<zxcv> '||'090A'x,200);
 Y1=       prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))) ;
 Z1=compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))));
run;
proc sql;
  create table T2 as
  select X
        ,        prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))  as Y2 length=32000
        , compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))) as Z2 length=32000
  from T1;
quit;
data ALL;
  merge T1 T2;
  LY1=length(Y1);
  LY2=length(Y2);
  LZ1=length(Z1);
  LZ2=length(Z2);
  putlog LY1= LY2= LZ1= LZ2= ;
run;

LY1=1604 LY2=1604 LZ1=1004 LZ2=124

 

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 586 views
  • 1 like
  • 2 in conversation