BookmarkSubscribeRSS Feed
LaurieF
Barite | Level 11

I've been puzzling over this post (Case-statement-trims-length-to-200) by @ChrisNZ - and the weirdness continues.

 

Before I found Chris's extensive analysis of what was going on with his problem, I managed to get a Perl expression to work at more than 200 bytes. But I had to jump through hoops to do it. And I still don't know quite what's going on.

 

A few of our source files have 32k-byte HTML text in them, which requires decoding and tag stripping. Not only that, but the often contain tabs, carriage returns and the like - all of which are a barrier for textual analysis.

 

My original code was

strip(compbl(translate(prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))), '     ', '08090a0b0c0d'x))) as message length=32767

It was unwieldy, but it worked. (htmldecode is called twice, because sometimes that's the only way of getting rid of the special characters - I don't control the source data!) And then I looked at the output and saw that no value was longer than 200 bytes - often less because of utf-8.

 

But cutting out one function at a time,  I found that prxchange was working as expected. So I created an extra work column and added one more function at a time, and it still worked.

 

Here's what I've ended up with:

prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))) as untagged_message length=32767,
strip(compbl(translate(calculated untagged_message, '     ', '08090a0b0c0d'x))) as message length=32767 /* six spaces between the quotes */

 I have no idea why splitting it into two expressions and columns works. Does anybody have any suggestions?

1 REPLY 1
ChrisNZ
Tourmaline | Level 20

Yes, this whole thing is most messy for no good reason at all, and needs (needed? maybe it's too late) a good clean up.

The data step behaves differently, and results are as expected. The SQL interpreter is the confused party here.

data T1;
 length X Y1 Z1 $32000;
 X=repeat('1234<zxcv> '||'090A'x,200);
 Y1=       prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))) ;
 Z1=compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))));
run;
proc sql;
  create table T2 as
  select X
        ,        prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))  as Y2 length=32000
        , compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))) as Z2 length=32000
  from T1;
quit;
data ALL;
  merge T1 T2;
  LY1=length(Y1);
  LY2=length(Y2);
  LZ1=length(Z1);
  LZ2=length(Z2);
  putlog LY1= LY2= LZ1= LZ2= ;
run;

LY1=1604 LY2=1604 LZ1=1004 LZ2=124

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 1000 views
  • 1 like
  • 2 in conversation