Hi everyone,
i spent a lot of time searching for a solution and tried a lot of the provided ideas, but no one worked for us. if i missed a good fitting answer, please feel free to point me into the right direction...
so, this is my first post ever, yay 🙂
our issue:
we have a character variable ("CLORRES") containing a lot of words, some separated by blanks, some by comma, e.g. "multiple, white, food contents, slight"
we need to extract some specific substrings from this variable (and these substrings contain also blanks, e.g. "food contents" and "white") which can appear on multiple positions. Then these findings/hits need be concatenated into another variable ("SUPP").
we tried:
scan (with different modifiers: q, s, t)
find (cant get it to work with a list of required subsstrings)
do loops
E.g.:
/* Transfer of modifier "Csec" ("Color secretion") */ /* (can appear on several positions) to SUPP: */ DO i=1 to 20; IF SCAN(CLORRES,i,', ') IN ('yellowish / greenish','food contents', 'dark', 'green contents', 'milky', 'pale', 'red contents', 'white contents', 'yellow contents') THEN DO; SUPP = CATX(', ',test,SCAN(CLORRES,i,', ','t')); /* transfer of modifier to SUPP */ END; END;
the results:
some of the substrings containing blanks are extracted (like "red contents", "green contents"), some are not ("food contents", "yellowish / greenish"). Also, any hits after a hit are not found by this code (e.g. "red contents, milky" will end up as "red contents", missig the "milky)
we cant get to understand the logic behind this (but it seems to have to do something with blanks being a standard delimiter for SCAN) and of course, cant figure out a solution...
please let me know if you need for info or examples, i am happy to share.
One solution i did not yet try are the prx functions as i am completely new to these. If you say this would be the way, i am happy to learn. But hoping for an easier solution...
Thanks in advance, looking forward to any hints and discussions!
Best regards from Germany
Irina
Hi @irinabe and welcome to the SAS Support Communities!
I'm sure we'll find a solution once a few points have been clarified:
So, the first question is: Are your search terms of type 1, 2 or 3?
Here's a slight modification of your code for handling type 3:
data have;
input clorres &$120.;
cards;
multiple, white, food contents, slight
red contents, milky, green contents, food contents, yellowish / greenish, dark blue
;
data want(drop=_:);
set have;
length test $120;
do _i=1 to countw(clorres,',');
_s=scan(clorres,_i,',','r');
if _s in ('yellowish / greenish', 'food contents', 'dark', 'green contents', 'milky',
'pale', 'red contents', 'white contents', 'yellow contents')
then test = catx(', ',test,_s);
end;
run;
Second, is it possible that a search term occurs multiple times in CLORRES (e.g. "dark red, dark blue") and, if so, would you like to see all those occurrences in variable TEST or just one?
Third, what about case sensitivity, e.g., should "dark" match "Dark"?
EDIT:
A more general remark: Character variables containing lists of words are awkward to work with. (For example, "red, blue" ne "red,blue" ne "blue,red".) Have you considered a "long" data structure where the comma separated terms occur as values of a shorter character variable in separate observations?
You have not supplied neither input test data nor list of sub-strings to look for and concatenate for output, nor the delimiter between the concatenated substrings .
You may try next code:
%let strings = <list of strings separated by comma >;
/* Note - if comma may be in a substring then choose ther character like # */
%let dlm = <delimiter used between substrings>;
data want;
set have;
retain search nums;
if _N_ = 1 then do;
search = %quote(&strings);
nums = countw(search,"&dlm");
end;
length new_var $100; /* adatpt to max length expected */
new_var = ' '; /* initialize for each observation */
do i=1 to nums;
if index(<varin>,scan(search,"&dlm",i)) > 0 then
new_var = catx("&dlm",new_var,scan(search,"&dlm"),i);
end;
keep new_var; /* add any other wanted variable */
run;
Hi Shmuel,
thanks for your quick reply!
Sorry, i thought the information from the code was helpful enough to get a rough understanding of the issue.
List of substring to look for: 'yellowish / greenish','food contents', 'dark', 'green contents', 'milky', 'pale', 'red contents', 'white contents', 'yellow contents'
input test data: see screenshot
Tried your code which looks very promising:
Data cl_test3; set cl_3; retain search nums; %let strings = yellowish / greenish,food contents,dark; /* Note - if comma may be in a substring then choose ther character like # */ %let dlm = ,; if _N_ = 1 then do; search = %quote(&strings); nums = countw(search,"&dlm"); end; length new_var $200; /* adatpt to max length expected */ new_var = ' '; /* initialize for each observation */ do i=1 to nums; if index(CLORRES,scan(search,"&dlm",i)) > 0 then new_var = catx("&dlm",new_var,scan(search,"&dlm"),i); end; RUN;
But the comma seems to cause trouble:
9309 Data cl_test3; 9310 set cl_3; 9311 retain search nums; 9312 9313 %let strings = yellowish / greenish,food contents,dark; 9314 /* Note - if comma may be in a substring then choose ther character like # */ 9315 %let dlm = ,; 9316 9317 9318 if _N_ = 1 then do; 9319 search = %quote(&strings); NOTE: Line generated by the macro variable "STRINGS". 1 yellowish / greenish,food contents,dark - 388 76 ERROR 388-185: Expecting an arithmetic operator. ERROR 76-322: Syntax error, statement will be ignored. 9320 nums = countw(search,"&dlm"); 9321 end;
Did I do something wrong?
BR
Irina
As I see, each input observation has at most one one required substring. Is it?
In such case there is no need to look for or concatenate input required substrings.
Check next code:
data cl_3;
input CLORRES $char80.;
cards;
discolored.redish
preset.condition.foamy.yellowish/greenish
preset.condition.mucous.white
preset.condition.foamy.yellowish/greenish
preset.condition.food contents.while
; run;
%let strings = yellowish / greenish,food contents,dark;
%let dlm = ,;
Data cl_test3;
set cl_3;
length search_for $20 search $140; /* adapt length if need */
retain search nums;
if _N_ = 1 then do;
search = "&strings"; /* assuming no double quotes in string */
nums = countw(search,"&dlm"); put nums=;
end;
length new_var $200; /* adatpt to max length expected */
new_var = ' '; /* initialize for each observation */
do i=1 to nums;
search_for = strip(scan(search,i,"&dlm")); put search_for=;
if index(CLORRES,strip(search_for)) > 0 then do;
new_var = search_for;
output;
end;
end;
keep clorres new_var;
RUN;
PAY ATTENTION -
1) In the substring list you asked for "yellowish / greenish" (with spaces around the slash)
while the input row is "preset.condition.foamy.yellowish/greenish" (no spaces).
You have to adapt the list of substrings or use compress on both sides.
2) Is your input data realy compressed, without spaces and the only delimiter between substrings is a dot "."?
3) So we are talking about two different delimiters:
(1) delimiter used in %LET statement, separating required substrings
(2) delimiter used in the input data to separate between the substrings.
Hi @irinabe and welcome to the SAS Support Communities!
I'm sure we'll find a solution once a few points have been clarified:
So, the first question is: Are your search terms of type 1, 2 or 3?
Here's a slight modification of your code for handling type 3:
data have;
input clorres &$120.;
cards;
multiple, white, food contents, slight
red contents, milky, green contents, food contents, yellowish / greenish, dark blue
;
data want(drop=_:);
set have;
length test $120;
do _i=1 to countw(clorres,',');
_s=scan(clorres,_i,',','r');
if _s in ('yellowish / greenish', 'food contents', 'dark', 'green contents', 'milky',
'pale', 'red contents', 'white contents', 'yellow contents')
then test = catx(', ',test,_s);
end;
run;
Second, is it possible that a search term occurs multiple times in CLORRES (e.g. "dark red, dark blue") and, if so, would you like to see all those occurrences in variable TEST or just one?
Third, what about case sensitivity, e.g., should "dark" match "Dark"?
EDIT:
A more general remark: Character variables containing lists of words are awkward to work with. (For example, "red, blue" ne "red,blue" ne "blue,red".) Have you considered a "long" data structure where the comma separated terms occur as values of a shorter character variable in separate observations?
Hi to the both of you!
Wow, thanks for putting so much effort in this question, i am flashed 🙂
Thank you both, i tried all your suggestions, and Reinhard's code finally solved the problem.
This line here was the game changer:
_s=scan(clorres,_i,',','r');
The data is rather like from Reinhard's point 1. Case senstitivity is not required, the values are always the same, in every occurence.
And you are sooo correct, working with these strings is awful!
But we cannot change anything about this as this is raw data from a legacy database.
Again, thank you both so much for helping me out! You had great and creative apporaches, i really learned something from you!
Take care, best regards 🙂
Irina
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: