BookmarkSubscribeRSS Feed
LaurieF
Barite | Level 11

I've been puzzling over this post (Case-statement-trims-length-to-200) by @ChrisNZ - and the weirdness continues.

 

Before I found Chris's extensive analysis of what was going on with his problem, I managed to get a Perl expression to work at more than 200 bytes. But I had to jump through hoops to do it. And I still don't know quite what's going on.

 

A few of our source files have 32k-byte HTML text in them, which requires decoding and tag stripping. Not only that, but the often contain tabs, carriage returns and the like - all of which are a barrier for textual analysis.

 

My original code was

strip(compbl(translate(prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))), '     ', '08090a0b0c0d'x))) as message length=32767

It was unwieldy, but it worked. (htmldecode is called twice, because sometimes that's the only way of getting rid of the special characters - I don't control the source data!) And then I looked at the output and saw that no value was longer than 200 bytes - often less because of utf-8.

 

But cutting out one function at a time,  I found that prxchange was working as expected. So I created an extra work column and added one more function at a time, and it still worked.

 

Here's what I've ended up with:

prxchange("s/<.*?>/ /", -1, htmldecode(htmldecode(message))) as untagged_message length=32767,
strip(compbl(translate(calculated untagged_message, '     ', '08090a0b0c0d'x))) as message length=32767 /* six spaces between the quotes */

 I have no idea why splitting it into two expressions and columns works. Does anybody have any suggestions?

1 REPLY 1
ChrisNZ
Tourmaline | Level 20

Yes, this whole thing is most messy for no good reason at all, and needs (needed? maybe it's too late) a good clean up.

The data step behaves differently, and results are as expected. The SQL interpreter is the confused party here.

data T1;
 length X Y1 Z1 $32000;
 X=repeat('1234<zxcv> '||'090A'x,200);
 Y1=       prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))) ;
 Z1=compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X))));
run;
proc sql;
  create table T2 as
  select X
        ,        prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))  as Y2 length=32000
        , compbl(prxchange("s/<.*?>|[\x08-\x0D]/ /", -1, htmldecode(htmldecode(X)))) as Z2 length=32000
  from T1;
quit;
data ALL;
  merge T1 T2;
  LY1=length(Y1);
  LY2=length(Y2);
  LZ1=length(Z1);
  LZ2=length(Z2);
  putlog LY1= LY2= LZ1= LZ2= ;
run;

LY1=1604 LY2=1604 LZ1=1004 LZ2=124

 

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 510 views
  • 1 like
  • 2 in conversation