BookmarkSubscribeRSS Feed
DougE
Calcite | Level 5

Hi,

I am trying to find a way to compress part of a string and cannot think of an easy way to do it.

I have string with values of "word1 x y z word2 word3" and would like to have the result be "word1 xyz word2 word3". 

Any ideas?

Thanks,

doug

12 REPLIES 12
art297
Opal | Level 21

Doug,

As always, I am sure that there are a number of ways to do what you want.  E.g.:

data have;

  informat string $50.;

  input string &;

  cards;

word1 word2 x y z word3

this x y z that something else

;

data want (keep=string);

  set have;

  array parts(5) $10.;

  i=1;

  j=0;

  k=0;

  do while (scan(string,i) ne "");

    if length(scan(string,i)) gt 1 then do;

      j=j+1;

      parts(j)=scan(string,i);

    end;

    else do;

      k=k+1;

      if k eq 1 then j=j+1;

      parts(j)=catt(parts(j),scan(string,i));

    end;

    i=i+1;

  end;

  string=catx(' ',of parts(*));

run;

data_null__
Jade | Level 19

Are there always 6 words and words 2,3,4 are compressed?

DougE
Calcite | Level 5

No, the number of words may vary and the the portion of the string may vary as well.

For example:

The string I'd like to compress may have 3 or more characters

x y z a b

word1 xyz a b  word2

x yz a b word1 word2 ... word6

word1 word2 x y zab

Sorry for being unclear in the inital post

art297
Opal | Level 21

Still not clear then!  What differentiates the words that should and shouldn't be compressed?

DougE
Calcite | Level 5

I know the sequence of characters that I want to compress. I am looking at internet search data and not everyone searches for a given company in the same way.  So, if I were interested in UUNET users may type in U U NET, UU NET, U UNET, etc.  The solution I am looking for does not have to pick up the instances where the characters are transposed (since the vast majority will put the characters in the correct order) just the instances where the charaters are in the 'right' order but the spacing is different.

art297
Opal | Level 21

I think that you will need a prxchange solution, but will have to wait until someone posts one.

You simply want to change the sequence of a space, followed by,

u

followed by: 0 or more spaces

followed by u

followed by: 0 or more spaces

followed by n

followed by: 0 or more spaces

followed by e

followed by: 0 or more spaces

followed by t

followed by: 1 or more spaces

and change it to

space

uunet

space

Unfortunately, I don't know how to write that.

DougE
Calcite | Level 5

Thanks for the suggestion!  I found a very helpful paper at http://www2.sas.com/proceedings/sugi29/129-29.pdf that describes the prx functions and I'm sure that I can figure it out from there.

art297
Opal | Level 21

The following does, I think, what you want.  I'm sure that it could be written better:

data have;

  informat string $50.;

  input string &;

cards;

word1 word2 u u net word3

this uu net that something else

;

data want;

  set have;

  want=prxchange('s/[ ][uU][ ]{0,1}[uU][ ]{0,1}[nN][ ]{0,1}[eE][ ]{0,1}[tT][ ]/ uunet /',-1,strip(string));

run;

Ksharp
Super User

Art.

I think it is very difficult. you do not know when to stop combining these single character.

x y z a b   ---->      xy zab ?       xyz ab ?          xyza b  ?      xyzab ?

A workaround is using a dictionary then look it up to see whether the combination of these characters  is a real word.

Ksharp

Peter_C
Rhodochrosite | Level 12

Ksharp points the way. (happy new year to Ksharp)

A dictionary or conversion table could list common misspellings and the correct form you prefer.  The columns would be

This $30

Should_be $30

Then you would load into an array like CORRECTIONS below

Proc SQL noprint ;

   Select nobs into :nobs_dict from dictionary.tables where libname ='DICTLIB' and memname = 'DICTIONARY';

quit;

Data corrected ;

   Array corrections(&nobs_dict,2) $30 _temporary_ ;

   If _n_=1 then do;

   do _i_=1 to &nobs_dict ;

      Set dictlib.dictionary point=_i_ ;

      corrections(_i_,1)=this;

      corrections(_i_,2)=should_be;

   end ;

   Set your.data ;

* now apply all dictionary entries to variable  target ;

   do _i_=1 to &nobs_dict ;

      Target = tranwrd( target, trim(Corrections(_i_,1)), trim(corrections(_i_,2)));

   end;

run;

* BEWARE this code is untested;

Message was edited by: Peter Crawford on a real computer because the iphone editor created some unhelpful extras

art297
Opal | Level 21

Peter,

Since the OP originally stated that the sequence of characters could be any combination of those characters, starting with and separated possibly by spaces, I think that a regex solution would still be advantageous.  Of course, it could always be combined with the development and utilization of a dictionary like you propose.

Shouldn't you finally break down and start learning how to capitalize on regular expressions?

Patrick
Opal | Level 21

Below some tested code based on Ksharp's and Art's suggestions.

What it does:
1 Data step 'SearchReplacePattern' to build a list of patterns based on target words. Eg. if the target word is UUNET then the data step will create a pattern which will match any combination of this string as defined. For UUNET the Regular Expression would be: \bU *U *N *E *T\b

2. Data step 'have' to create some sample data

3. Data step 'want' compiles in the first iteration all patterns as needed for prxchange(). Then loops for every iteration of the data step through all compiled RegEx using prxchange() to replace all matching patterns with the desired target value.

data SearchReplacePattern(keep=_SearchReplacePattern);;

  length TargetWord $40 _SearchReplacePattern $160.;

  do TargetWord  = 'UUNET','ABC' ;

    _SearchReplacePattern=prxchange('s/(\S{1}\B)/$1 */i',-1,strip(TargetWord));

    _SearchReplacePattern=cats('s/\b',_SearchReplacePattern,'\b/',TargetWord,'/i');

    put _SearchReplacePattern=;

    output;

  end;

run;

data have;

  length String $ 40;

  String='word1 word2 uU net wordN';    

  output;

  String='ab c word2 ab cword3 u u Net word a b c';

  output;

run;

data want(drop=_:);

  set have;

  if _n_=1 then

  do;

    do _i=1 to last;

      set SearchReplacePattern nobs=last point=_i ;

      _PatternID=prxparse(_SearchReplacePattern);

    end;

  end;

  put 'Before: ' @9 String ;

  do _PatternID=1 to last;

    String=prxchange(_PatternID,-1,String);

  end;

  put 'After: ' @9 String /;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2545 views
  • 3 likes
  • 6 in conversation