DATA Step, Macro, Functions and more

Compress part of a string

Reply
New Contributor
Posts: 4

Compress part of a string

Hi,

I am trying to find a way to compress part of a string and cannot think of an easy way to do it.

I have string with values of "word1 x y z word2 word3" and would like to have the result be "word1 xyz word2 word3". 

Any ideas?

Thanks,

doug

PROC Star
Posts: 7,467

Compress part of a string

Doug,

As always, I am sure that there are a number of ways to do what you want.  E.g.:

data have;

  informat string $50.;

  input string &;

  cards;

word1 word2 x y z word3

this x y z that something else

;

data want (keep=string);

  set have;

  array parts(5) $10.;

  i=1;

  j=0;

  k=0;

  do while (scan(string,i) ne "");

    if length(scan(string,i)) gt 1 then do;

      j=j+1;

      parts(j)=scan(string,i);

    end;

    else do;

      k=k+1;

      if k eq 1 then j=j+1;

      parts(j)=catt(parts(j),scan(string,i));

    end;

    i=i+1;

  end;

  string=catx(' ',of parts(*));

run;

Respected Advisor
Posts: 3,799

Compress part of a string

Are there always 6 words and words 2,3,4 are compressed?

New Contributor
Posts: 4

Compress part of a string

Posted in reply to data_null__

No, the number of words may vary and the the portion of the string may vary as well.

For example:

The string I'd like to compress may have 3 or more characters

x y z a b

word1 xyz a b  word2

x yz a b word1 word2 ... word6

word1 word2 x y zab

Sorry for being unclear in the inital post

PROC Star
Posts: 7,467

Compress part of a string

Still not clear then!  What differentiates the words that should and shouldn't be compressed?

New Contributor
Posts: 4

Compress part of a string

I know the sequence of characters that I want to compress. I am looking at internet search data and not everyone searches for a given company in the same way.  So, if I were interested in UUNET users may type in U U NET, UU NET, U UNET, etc.  The solution I am looking for does not have to pick up the instances where the characters are transposed (since the vast majority will put the characters in the correct order) just the instances where the charaters are in the 'right' order but the spacing is different.

PROC Star
Posts: 7,467

Compress part of a string

I think that you will need a prxchange solution, but will have to wait until someone posts one.

You simply want to change the sequence of a space, followed by,

u

followed by: 0 or more spaces

followed by u

followed by: 0 or more spaces

followed by n

followed by: 0 or more spaces

followed by e

followed by: 0 or more spaces

followed by t

followed by: 1 or more spaces

and change it to

space

uunet

space

Unfortunately, I don't know how to write that.

New Contributor
Posts: 4

Compress part of a string

Thanks for the suggestion!  I found a very helpful paper at http://www2.sas.com/proceedings/sugi29/129-29.pdf that describes the prx functions and I'm sure that I can figure it out from there.

PROC Star
Posts: 7,467

Re: Compress part of a string

The following does, I think, what you want.  I'm sure that it could be written better:

data have;

  informat string $50.;

  input string &;

cards;

word1 word2 u u net word3

this uu net that something else

;

data want;

  set have;

  want=prxchange('s/[ ][uU][ ]{0,1}[uU][ ]{0,1}[nN][ ]{0,1}[eE][ ]{0,1}[tT][ ]/ uunet /',-1,strip(string));

run;

Super User
Posts: 10,018

Re: Compress part of a string

Art.

I think it is very difficult. you do not know when to stop combining these single character.

x y z a b   ---->      xy zab ?       xyz ab ?          xyza b  ?      xyzab ?

A workaround is using a dictionary then look it up to see whether the combination of these characters  is a real word.

Ksharp

Valued Guide
Posts: 2,177

Re: Compress part of a string

Ksharp points the way. (happy new year to Ksharp)

A dictionary or conversion table could list common misspellings and the correct form you prefer.  The columns would be

This $30

Should_be $30

Then you would load into an array like CORRECTIONS below

Proc SQL noprint ;

   Select nobs into :nobs_dict from dictionary.tables where libname ='DICTLIB' and memname = 'DICTIONARY';

quit;

Data corrected ;

   Array corrections(&nobs_dict,2) $30 _temporary_ ;

   If _n_=1 then do;

   do _i_=1 to &nobs_dict ;

      Set dictlib.dictionary point=_i_ ;

      corrections(_i_,1)=this;

      corrections(_i_,2)=should_be;

   end ;

   Set your.data ;

* now apply all dictionary entries to variable  target ;

   do _i_=1 to &nobs_dict ;

      Target = tranwrd( target, trim(Corrections(_i_,1)), trim(corrections(_i_,2)));

   end;

run;

* BEWARE this code is untested;

Message was edited by: Peter Crawford on a real computer because the iphone editor created some unhelpful extras

PROC Star
Posts: 7,467

Re: Compress part of a string

Peter,

Since the OP originally stated that the sequence of characters could be any combination of those characters, starting with and separated possibly by spaces, I think that a regex solution would still be advantageous.  Of course, it could always be combined with the development and utilization of a dictionary like you propose.

Shouldn't you finally break down and start learning how to capitalize on regular expressions?

Respected Advisor
Posts: 4,173

Re: Compress part of a string

Below some tested code based on Ksharp's and Art's suggestions.

What it does:
1 Data step 'SearchReplacePattern' to build a list of patterns based on target words. Eg. if the target word is UUNET then the data step will create a pattern which will match any combination of this string as defined. For UUNET the Regular Expression would be: \bU *U *N *E *T\b

2. Data step 'have' to create some sample data

3. Data step 'want' compiles in the first iteration all patterns as needed for prxchange(). Then loops for every iteration of the data step through all compiled RegEx using prxchange() to replace all matching patterns with the desired target value.

data SearchReplacePattern(keep=_SearchReplacePattern);;

  length TargetWord $40 _SearchReplacePattern $160.;

  do TargetWord  = 'UUNET','ABC' ;

    _SearchReplacePattern=prxchange('s/(\S{1}\B)/$1 */i',-1,strip(TargetWord));

    _SearchReplacePattern=cats('s/\b',_SearchReplacePattern,'\b/',TargetWord,'/i');

    put _SearchReplacePattern=;

    output;

  end;

run;

data have;

  length String $ 40;

  String='word1 word2 uU net wordN';    

  output;

  String='ab c word2 ab cword3 u u Net word a b c';

  output;

run;

data want(drop=_Smiley Happy;

  set have;

  if _n_=1 then

  do;

    do _i=1 to last;

      set SearchReplacePattern nobs=last point=_i ;

      _PatternID=prxparse(_SearchReplacePattern);

    end;

  end;

  put 'Before: ' @9 String ;

  do _PatternID=1 to last;

    String=prxchange(_PatternID,-1,String);

  end;

  put 'After: ' @9 String /;

run;

Ask a Question
Discussion stats
  • 12 replies
  • 388 views
  • 3 likes
  • 6 in conversation