Solved: Re: Extraction of multiple substrings (inclunding blanks)

irinabe · Posted 03-18-2021 06:20 AM

Hi everyone,

i spent a lot of time searching for a solution and tried a lot of the provided ideas, but no one worked for us. if i missed a good fitting answer, please feel free to point me into the right direction...

so, this is my first post ever, yay 🙂

our issue:

we have a character variable ("CLORRES") containing a lot of words, some separated by blanks, some by comma, e.g. "multiple, white, food contents, slight"

we need to extract some specific substrings from this variable (and these substrings contain also blanks, e.g. "food contents" and "white") which can appear on multiple positions. Then these findings/hits need be concatenated into another variable ("SUPP").

we tried:

scan (with different modifiers: q, s, t)

find (cant get it to work with a list of required subsstrings)

do loops

E.g.:

		/* Transfer of modifier "Csec" ("Color secretion") */
		/* (can appear on several positions) to SUPP: */
			DO i=1 to 20;
				IF SCAN(CLORRES,i,', ') IN ('yellowish / greenish','food contents', 'dark', 'green contents', 'milky', 'pale', 'red contents', 'white contents', 'yellow contents') THEN DO;
					SUPP = CATX(', ',test,SCAN(CLORRES,i,', ','t')); /* transfer of modifier to SUPP */
				END;	
			END;

the results:

some of the substrings containing blanks are extracted (like "red contents", "green contents"), some are not ("food contents", "yellowish / greenish"). Also, any hits after a hit are not found by this code (e.g. "red contents, milky" will end up as "red contents", missig the "milky)

we cant get to understand the logic behind this (but it seems to have to do something with blanks being a standard delimiter for SCAN) and of course, cant figure out a solution...

please let me know if you need for info or examples, i am happy to share.

One solution i did not yet try are the prx functions as i am completely new to these. If you say this would be the way, i am happy to learn. But hoping for an easier solution...

Thanks in advance, looking forward to any hints and discussions!

Best regards from Germany

Irina

FreelanceReinh · Posted 03-18-2021 08:00 AM

Hi @irinabe and welcome to the SAS Support Communities!

I'm sure we'll find a solution once a few points have been clarified:

The term "substring" is more general than "word." For example, "green" is a substring of "yellowish / greenish" and could be found therein by means of the FIND function, but not really with SCAN.
A slightly different task would be to find the word "dark" in the string "light green, dark blue" (but not in "darker") -- the FINDW function would be suitable for this.
Similarly, if the comma is defined as the word delimiter, then "food contents" can be found as a word in "white,food contents,slight" and with suitable code also in "white, food contents, slight".

So, the first question is: Are your search terms of type 1, 2 or 3?

Here's a slight modification of your code for handling type 3:

data have;
input clorres &$120.;
cards;
multiple, white, food contents, slight
red contents, milky, green contents, food contents, yellowish / greenish, dark blue
;

data want(drop=_:);
set have;
length test $120;
do _i=1 to countw(clorres,',');
  _s=scan(clorres,_i,',','r');
  if _s in ('yellowish / greenish', 'food contents', 'dark', 'green contents', 'milky',
            'pale', 'red contents', 'white contents', 'yellow contents')
  then test = catx(', ',test,_s);
end;
run;

Second, is it possible that a search term occurs multiple times in CLORRES (e.g. "dark red, dark blue") and, if so, would you like to see all those occurrences in variable TEST or just one?

Third, what about case sensitivity, e.g., should "dark" match "Dark"?

EDIT:

A more general remark: Character variables containing lists of words are awkward to work with. (For example, "red, blue" ne "red,blue" ne "blue,red".) Have you considered a "long" data structure where the comma separated terms occur as values of a shorter character variable in separate observations?

View solution in original post

Shmuel · Posted 03-18-2021 07:33 AM

You have not supplied neither input test data nor list of sub-strings to look for and concatenate for output, nor the delimiter between the concatenated substrings .

You may try next code:

%let strings = <list of strings separated by comma >;
/* Note - if comma may be in a substring then choose ther character like # */
%let dlm = <delimiter used between substrings>;

data want;
 set have;
 	 retain search nums;
     if _N_ = 1 then do;
        search = %quote(&strings);
		nums = countw(search,"&dlm");
	 end;
	 length new_var $100; /* adatpt to max length expected */
	 new_var = ' '; /* initialize for each observation */
	 do i=1 to nums;
	    if index(<varin>,scan(search,"&dlm",i)) > 0 then  
		   new_var = catx("&dlm",new_var,scan(search,"&dlm"),i);
	 end;
     keep new_var;  /* add any other wanted variable */
run;

irinabe · Posted 03-18-2021 08:55 AM

Hi Shmuel,

thanks for your quick reply!

Sorry, i thought the information from the code was helpful enough to get a rough understanding of the issue.

List of substring to look for: 'yellowish / greenish','food contents', 'dark', 'green contents', 'milky', 'pale', 'red contents', 'white contents', 'yellow contents'

input test data: see screenshot

Tried your code which looks very promising:

Data cl_test3;
set cl_3;
retain search nums;

%let strings = yellowish / greenish,food contents,dark;
/* Note - if comma may be in a substring then choose ther character like # */
%let dlm = ,;


 if _N_ = 1 then do;
    search = %quote(&strings);
	nums = countw(search,"&dlm");
 end;
 length new_var $200; /* adatpt to max length expected */
 new_var = ' '; /* initialize for each observation */
 do i=1 to nums;
    if index(CLORRES,scan(search,"&dlm",i)) > 0 then  
	   new_var = catx("&dlm",new_var,scan(search,"&dlm"),i);
 end;

RUN;

But the comma seems to cause trouble:

9309   Data cl_test3;
9310   set cl_3;
9311   retain search nums;
9312
9313   %let strings = yellowish / greenish,food contents,dark;
9314   /* Note - if comma may be in a substring then choose ther character like # */
9315   %let dlm = ,;
9316
9317
9318    if _N_ = 1 then do;
9319       search = %quote(&strings);
NOTE: Line generated by the macro variable "STRINGS".
1      yellowish / greenish,food contents,dark
                           -
                           388
                           76
ERROR 388-185: Expecting an arithmetic operator.

ERROR 76-322: Syntax error, statement will be ignored.

9320       nums = countw(search,"&dlm");
9321    end;

Did I do something wrong?

BR

Irina

Shmuel · Posted 03-18-2021 11:51 AM

As I see, each input observation has at most one one required substring. Is it?

In such case there is no need to look for or concatenate input required substrings.

Check next code:

data cl_3;
  input CLORRES $char80.;
cards;
discolored.redish
preset.condition.foamy.yellowish/greenish
preset.condition.mucous.white
preset.condition.foamy.yellowish/greenish
preset.condition.food contents.while
; run;

%let strings = yellowish / greenish,food contents,dark;
%let dlm = ,;

Data cl_test3;
 set cl_3;
     length search_for $20 search $140; /* adapt length if need */ 
	 retain search nums;
	 if _N_ = 1 then do;
	    search = "&strings"; /* assuming no double quotes in string */
        nums = countw(search,"&dlm"); put nums=;
	 end;
	 length new_var $200; /* adatpt to max length expected */
	 new_var = ' '; /* initialize for each observation */
	 do i=1 to nums;
	    search_for = strip(scan(search,i,"&dlm")); put search_for=; 
	    if index(CLORRES,strip(search_for)) > 0 then do; 
		   new_var = search_for;
		   output;
		end;
	 end;
     keep clorres new_var;
RUN;

PAY ATTENTION -

1) In the substring list you asked for "yellowish / greenish" (with spaces around the slash)

while the input row is "preset.condition.foamy.yellowish/greenish" (no spaces).

You have to adapt the list of substrings or use compress on both sides.

2) Is your input data realy compressed, without spaces and the only delimiter between substrings is a dot "."?

3) So we are talking about two different delimiters:

(1) delimiter used in %LET statement, separating required substrings

(2) delimiter used in the input data to separate between the substrings.

FreelanceReinh · Posted 03-18-2021 08:00 AM

Hi @irinabe and welcome to the SAS Support Communities!

I'm sure we'll find a solution once a few points have been clarified:

The term "substring" is more general than "word." For example, "green" is a substring of "yellowish / greenish" and could be found therein by means of the FIND function, but not really with SCAN.
A slightly different task would be to find the word "dark" in the string "light green, dark blue" (but not in "darker") -- the FINDW function would be suitable for this.
Similarly, if the comma is defined as the word delimiter, then "food contents" can be found as a word in "white,food contents,slight" and with suitable code also in "white, food contents, slight".

So, the first question is: Are your search terms of type 1, 2 or 3?

Here's a slight modification of your code for handling type 3:

data have;
input clorres &$120.;
cards;
multiple, white, food contents, slight
red contents, milky, green contents, food contents, yellowish / greenish, dark blue
;

data want(drop=_:);
set have;
length test $120;
do _i=1 to countw(clorres,',');
  _s=scan(clorres,_i,',','r');
  if _s in ('yellowish / greenish', 'food contents', 'dark', 'green contents', 'milky',
            'pale', 'red contents', 'white contents', 'yellow contents')
  then test = catx(', ',test,_s);
end;
run;

Second, is it possible that a search term occurs multiple times in CLORRES (e.g. "dark red, dark blue") and, if so, would you like to see all those occurrences in variable TEST or just one?

Third, what about case sensitivity, e.g., should "dark" match "Dark"?

EDIT:

A more general remark: Character variables containing lists of words are awkward to work with. (For example, "red, blue" ne "red,blue" ne "blue,red".) Have you considered a "long" data structure where the comma separated terms occur as values of a shorter character variable in separate observations?

irinabe · Posted 03-19-2021 03:59 AM

Hi to the both of you!

Wow, thanks for putting so much effort in this question, i am flashed 🙂

Thank you both, i tried all your suggestions, and Reinhard's code finally solved the problem.

This line here was the game changer:

_s=scan(clorres,_i,',','r');

The data is rather like from Reinhard's point 1. Case senstitivity is not required, the values are always the same, in every occurence.

And you are sooo correct, working with these strings is awful!

But we cannot change anything about this as this is raw data from a legacy database.

Again, thank you both so much for helping me out! You had great and creative apporaches, i really learned something from you!

Take care, best regards 🙂

Irina

Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Re: Extraction of multiple substrings (inclunding blanks)

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away