DATA Step, Macro, Functions and more

Whitespace woes

Reply
Frequent Contributor
Posts: 107

Whitespace woes

[ Edited ]

From my understanding these three statements should be equivalent

 

 

CATX('', COL1, COL2, COL3)
TRIM(COL1) || TRIM(COL2) || TRIM(COL3)
PRXCHANGE('s/\s+//', -1, COL1) || PRXCHANGE('s/\s+//', -1, COL2) || PRXCHANGE('s/\s+//', -1, COL3)

 

COL1, COL2, and COL3 are characters.

 

I check for whitespace with:

 

PRXCHANGE('s/\s/X/', -1, CONCATENATED_COLUMN) 

 

1 and 3 produce equivalent columns but 2 does not.  All three produce trailing whitespace.  I would like to check this column for characters like:

 

CASE WHEN PRXMATCH('/^0+$/', CONCATENATED_COLUMN) THEN 1 ELSE 0 END AS FOO

But can't because of the whitespace that still appears.  Must I always account for whitespace whenever I do any operation like this?  Something like this:

 

 

CASE WHEN PRXMATCH('/^0+(?:\s*)$/', CONCATENATED_COLUMN) THEN 1 ELSE 0 END AS FOO

Or:

 

JOIN ON PRXCHANGE('s/\s+//', -1, TABLE_A.ID) = PRXCHANGE('s/\s+//', -1, TABLE_B.ID)

Because I can never be sure CATX or TRIM did their job? 

Super User
Posts: 24,004

Re: Whitespace woes

Posted in reply to tomcmacdonald

Replace TRIM with STRIP and what does that do?

Frequent Contributor
Posts: 107

Re: Whitespace woes

Same result as TRIM. 

SAS Super FREQ
Posts: 9,431

Re: Whitespace woes

[ Edited ]
Posted in reply to tomcmacdonald

Hi: I don't understand why you are even using CATX. That function assumes you want to introduce some character between every argument. Personally, if you want there NOT to be ANY whitespace or blank between the arguments, I would use CATT instead of CATX.

 

no_blank.png

Cynthia

Frequent Contributor
Posts: 107

Re: Whitespace woes

[ Edited ]
Posted in reply to Cynthia_sas

CATT eliminates only trailing whitespace according to:

http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n16y1vyi397p84n19dpvzw0pemsa.htm&...

CATX eliminates trailing and leading whitespace according to:

http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0p7wxtk0hvn83n1pveisbcp2ae9.htm&...

That was my rationale for choosing CATX with a blank delimiter. 

 

My tentative solution to this is:

 

PRXCHANGE('s/\s+//', -1, (COL1 || COL2 || COL3))
Super User
Super User
Posts: 8,279

Re: Whitespace woes

Posted in reply to tomcmacdonald

You told CATX() to insert a space between the values.  If you just place two quotes next to each other then you are telling SAS you want a blank string of length one.  You can generate an empty string using the TRIMN() function.

 

Don't use a function, catx(), designed to insert something between the values when you don't actually want something inserted. Use CATT() or CATS().

 

Also what does \s mean in regular expressions?  Is it only spaces? or does it include other "white space"?

 

 

Frequent Contributor
Posts: 107

Re: Whitespace woes

A 0-length character does not exist in SAS?  '' actually means ' '? 

 

According to the PERL regular expression manual:

\s means the five characters [ \f\n\r\t]

 

https://perldoc.perl.org/perlre.html

 

Super User
Super User
Posts: 8,279

Re: Whitespace woes

Posted in reply to tomcmacdonald

For normal operations what would you do with a zero length character string?   If you assign it to an actual variable it will result in at least one space anyway.   SAS stores character variables as fixed length and fills (pads) with spaces.  It ignores trailing spaces when comparing strings.

 

If you want to eliminate trailing/leading characters other than spaces then you will need to use regex.

 

Or convert them to spaces and then use strip().

data _null_;
  length var1 var2 $10 ;
  var1='XYZ';
  var2=cat(' ','09'x,'XYZ','09'x,' ','A0'x);
  put (2*var1 2*var2) (=$quote. / =$hex./);
  if var1 ne var2 then put 'NOT EQUAL';
  if var1  = strip(translate(var2,' ','090A0D00A0'x)) then put 'EQUAL';
run;

 

 

 

 

 

 

 

Super User
Super User
Posts: 8,279

Re: Whitespace woes

Posted in reply to tomcmacdonald

The first 6 look like normal control characters. Not sure what the others are used for.

The set of characters that are deemed whitespace are those that Unicode calls "Pattern White Space", namely:

 

  1. U+0009 CHARACTER TABULATION
  2. U+000A LINE FEED
  3. U+000B LINE TABULATION
  4. U+000C FORM FEED
  5. U+000D CARRIAGE RETURN
  6. U+0020 SPACE
  7. U+0085 NEXT LINE
  8. U+200E LEFT-TO-RIGHT MARK
  9. U+200F RIGHT-TO-LEFT MARK
  10. U+2028 LINE SEPARATOR
  11. U+2029 PARAGRAPH SEPARATOR

 

 

Super User
Posts: 13,941

Re: Whitespace woes

Posted in reply to tomcmacdonald

It never hurts to provide a concrete example involving actual values and what the result is supposed to be at the end.

Showing what does not work does not really tell us what you want to accomplish.

 


@tomcmacdonald wrote:

From my understanding these three statements should be equivalent

 

 


CATX('', COL1, COL2, COL3)
TRIM(COL1) || TRIM(COL2) || TRIM(COL3)
PRXCHANGE('s/\s+//', -1, COL1) || PRXCHANGE('s/\s+//', -1, COL2) || PRXCHANGE('s/\s+//', -1, COL3)

What makes you think they should be equivalent?

For one thing you can get different default lengths.

From the documentation in CATX:

In a DATA step, if the CATX function returns a value to a variable that has not previously been assigned a length, then that variable is given a length of 200 bytes. If the concatenation operator (||) returns a value to a variable that has not previously been assigned a length, then that variable is given a length that is the sum of the lengths of the values that are being concatenated.

Throw Prxchange in with || and I am not going to guess what the overall default length for the result would be.

Strip or Trim(left(var)) would be much closer to cats.

Trim removes trailing blanks and returns one blank if the string is missing. So not the equivalent of using Cats or catx in either form.

data example;
   a='a ';
   b='  ';
   c=' c';
   abc = trim(a)||trim(b)||trim(c);
   hij = strip(a)||strip(b)||strip(c);
   xyz = cats(a,b,c);
   pdq = catx('',a,b,c);
run;
SAS Super FREQ
Posts: 9,431

Re: Whitespace woes

Well, if you are worried about leading blanks, then use CATS...or, just do what we used to do before the CAT family of functions -- the original way is shown below --  using TRIM(LEFT(varname)) with the || operator or just using LEFT with CATT:

no_blank2.png

 

Cynthia

Occasional Contributor
Posts: 7

Re: Whitespace woes

Posted in reply to tomcmacdonald

Any two of the three expressions are not equivalent to each other.

 

The CATX function is equivalent to strip(COL1)||" "||strip(COL2)||" "||strip(COL3), provided that a single blank character was defined to be the delimiter in the original example.

 

There are no functions equivalent to the second expression, although the closest one whould be CATT function, which is actually equivalent to trimn(COL1)||trimn(COL2)||trimn(COL3).

 

The third expression is equivalent to compress(COL1)||compress(COL2)||compress(COL3), resulting that all the space characters are removed, not just leading and trailing blanks.

Ask a Question
Discussion stats
  • 11 replies
  • 347 views
  • 4 likes
  • 6 in conversation