BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MR_E
Obsidian | Level 7

 

Hello All,

 

As a newbie to creating macro's i was hoping for a tad of help if you please:

 

i want to identify certain text that meets conditions, one being that i have an extensive list of values in a dataset that are "keywords" as such.

 

Rather than using %let Colour = 'BLUE' 'RED' etc , i was hoping to compile a macro that would look at each value individually (from a list of 1000plus entries on a dataset) against a seperate piece of text data.

 

here is an example below using colours, where i want to match using regular expressions (ideal for my actual process) to look for the colour plus car , ball or yacht.

 

The below example is wrong as i know it gives me a varlist value of all of the colours in row, where im hoping a macro could simply give me a reference for all values (it looks at each value individually). I'm probably leaning towards a %do loop method but my attempts thus far are miserable.

 

example below:

 

 

DATA Colour_text;

input text $CHAR38.;

datalines;

i want a BLUE BALL

did you see a blue car

i have owned a yellow yacht

id never buy a red ball

we want a turqoise car

she said it was a yellow plane

he drove in his green Porsche

i didnt like his black heart

he had a grey ball

i saw a white plane

;

run;

/*could be a list of 1000 plus entries*/

DATA Colours;

input NAME $CHAR10.;

datalines;

BLUE

YELLOW

GREEN

RED

BLACK

WHITE

;

run;

/*macro required*/

 

data _null_;

length allcolours $1000;

retain allcolours ' ';

set Colours end=eof;

allcolours = trim(left(allcolours))||' '||left(NAME);

if eof then call symput('varlist', allcolours);

run;

%put &varlist;

/*identify only colours from the colour list with the trailing words car/ball/yacht */

data WHAT_I_want;

set colour_text;

if prxmatch("/[a-z]*&varlist;([a-z]| car| ball| yacht)/i",text);

run;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

To make this handle large cases remove the macro variable and add DO loops.

 

Something like this.  Read more about how to use REGEX functions in SAS and you could possible optimize it more.

 

data want;
  array regex (4000) $200 _temporary_ ;
* load colours names into list;
  if _n_=1 then do p=1 to nobs ;
     set colours point=p nobs=nobs;
     * Build REGEX for this "NAME" ;
     regex(p)=cats("/[a-z]*(",NAME,")([a-z]| car| ball| yacht)/i");
  end;
* keep matches;
   set Colour_text;
   found=0;
   do i=1 to nobs until(found);
      found=prxmatch(regex(i),text);
   end;
   if found;
run;
 

View solution in original post

18 REPLIES 18
DanielSantos
Barite | Level 11

Hi.

 

Give a look at my last solution proposal on this topic:

 

https://communities.sas.com/t5/Base-SAS-Programming/Hide-the-name-inside-the-text/m-p/321836#U321836

 

That's exactly what you need, just remove the masking code.

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

MR_E
Obsidian | Level 7

Hi DanielSantos,

 

thank you for the prompt response - i applied the logic you used previously and adjusted to my query ,removing the masks, but receive nil results - are you able to see why?

 

i also tried the IF clause you had, but that brought back all results, (as it would without including a clause for yacht car etc as in my reg exp)

 

data _null_;

set Colours;

call symput('NAMES',catx(' ',symget('NAMES'),upcase(NAME)));

run;

%put NAMES=&NAMES;

data want;

length _WORD $200 _SEP $1;

set Colour_text;

drop _:; * drop aux vars;

_text=text; _text=''; _SEP=''; * init aux vars;

* process one word;

do _I=1 to countw(text);

_text=catx(_SEP,_text,_WORD);

_WORD=scan(text,_I,' ');

/* * if match, mask word; */

/**/

/* if index(" &NAMES ",cat(' ',strip(upcase(_WORD)),' ')) then do;*/

if prxmatch("/[a-z]*&NAMES;([a-z]| car| ball| yacht)/i",text) then OUTPUT;

/* _WORD=repeat('X',lengthn(_WORD)-1);*/

/* _SEP='X'; */

/* end;*/

/* else _SEP=' '; * clear masking separator;*/

end;

text=catx(_SEP,_text,_WORD);

run;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

This bit of code adds found and found2 to color_text and saves to dataset want:

data _null_;
  set colours end=last;
  if _n_=1 then call execute('data want; set colour_text;');
  call execute(cats('if index(upcase(text),"',name,'") > 0 then found=1;'));
  if last then call execute('if scan(reverse(text),1," ") in ("rac","llab","thcay") then found2=1;run;');
run;

What this does is, by using the second dataset, generate the code necessary to parse the first and add in the flags.  If you un it and look at the log you will see the generated code looks like:

NOTE: CALL EXECUTE generated line.
1     + data want; set colour_text;
2     + if index(upcase(text),"BLUE") > 0 then found=1;
3     + if index(upcase(text),"YELLOW") > 0 then found=1;
4     + if index(upcase(text),"GREEN") > 0 then found=1;
5     + if index(upcase(text),"RED") > 0 then found=1;
6     + if index(upcase(text),"BLACK") > 0 then found=1;
7     + if index(upcase(text),"WHITE") > 0 then found=1;
8     + if scan(reverse(text),1," ") in ("rac","llab","thcay") then found2=1;run;

NOTE: There were 10 observations read from the data set WORK.COLOUR_TEXT.

Do note however, it is a more advanced topic, not sure how good your knowledge of Base SAS is - remember macro and code generation are only there to help, they are not a replacement for Base SAS.  For instance your problem could also be solved by merging colours onto colour_text, based on finding the string within the list or transposing the colors:

proc transpose data=colours out=t_colours;
  var name;
run;

proc sql;
  create table WANT as
  select  A.*,
          B.*
  from    COLOUR_TEXT A
  left join T_COLOURS B
  on      1=1;
quit;

data want;
  set want;
  array col col:;
  do i=1 to dim(col);
    if found=0 then found=findw(upcase(text),strip(col{i}));
  end;
run;

There are various methods.

MR_E
Obsidian | Level 7

Hi RW9

 

whilst that is a interesting and useful approach, it still does not give me the output im after ,i.e it does not show those with just the values i require. ideally i would like to keep this within a macro also, but do appreciate the alternative approach

 

text _NAME_ COL1 COL2 COL3 COL4 COL5 COL6                                            i   found
i want a BLUE BALL NAME BLUE YELLOW GREEN RED BLACK WHITE          7       .
did you see a blue car NAME BLUE YELLOW GREEN RED BLACK WHITE       7        .
i have owned a yellow yacht NAME BLUE YELLOW GREEN RED BLACK WHITE 7    .
id never buy a red ball NAME BLUE YELLOW GREEN RED BLACK WHITE       7        .
we want a turqoise car NAME BLUE YELLOW GREEN RED BLACK WHITE       7        .
she said it was a yellow plane NAME BLUE YELLOW GREEN RED BLACK WHITE 7 .
he drove in his green Porsche NAME BLUE YELLOW GREEN RED BLACK WHITE 7 .
i didnt like his black heart NAME BLUE YELLOW GREEN RED BLACK WHITE 7 .
he had a grey ball NAME BLUE YELLOW GREEN RED BLACK WHITE 7 .
i saw a white plane NAME BLUE YELLOW GREEN RED BLACK WHITE 7 .

Tom
Super User Tom
Super User

Not sure how the example above doesn't give the results you want?  Perhaps you mean you would rather the output dataset just had those observations where the calculated FOUND flag was not zero?   If so just add an appropriate condition output (or delete) statement.

 

Macro code is only useful for generating code. It is not too useful for manipulating data. Your problem is really about comparing two sets of DATA so the real problem is to find SAS code that solves the problem. Then you can see if macro logic can help you generate that code.

 

But be aware of the limits of SAS strings. Variables (like the regular expression in your original post) can only be 32K long. So a long list of text values to find might exceed that limit.  Similarly macro variables are limited to 64K.  So a method like above that uses multiple variables to store the strings or one that uses multiple observations is probably the one that will give you general solution that will work for future sets of keywords and strings.

MR_E
Obsidian | Level 7

Hi Tom,

 

the second version of the code ( starting with proc transpose ) did not identify just those results with car / yacht etc , it brought back everything with found = .

 

the first version did however, i have just got back with an alternative scenario also to refrain from using "last" as the second keyword may not be the last word.

 

 

Tom
Super User Tom
Super User

If you mathcing logic is more complex then you probably want to change from using a simple FIND() function call to using more complex test such as the REGEX.  You can build the REGEX from VARIABLES instead of MACRO VARIABLES.  Although if your datasets are large you might want to looking into ways to get each distinct generated REGEX expression parsed only once for the data step rather than re-compiling them for every observation of that data you are checking.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Just a couple of points, the data _null_ approach is not really any different to macro, it is merely generating the text which gets included as code later on.  It can be changed in anyway you want, just change:

  if last then call execute('if scan(reverse(text),1," ") in ("rac","llab","thcay") then found2=1;run;');

To something appropriate to your requirements:

  if last then call execute('if scan(reverse(text),1," ") in ("rac","llab","thcay") then found2=1; if found=1 and found1=1 then output;run;');

 Don't get into the mindset that macro is something special and should be used for everything.  It is an additional tool, and does nothing in of itself, only generates strings.  Almost all the time I see macro code, it is messy, obfuscated and unnecessary.  A good grasp of Base SAS is far more important as almost everything can be done there, either by processing or by data modelling.

MR_E
Obsidian | Level 7

Apologies RW9 - i see what you are doing there and it is genius.

 

I wondered why the first value was not included, but i need to add in a case insensitive clause to your query which will solve that.

 

just to throw a spanner in the works (i should have been more precise earlier), how do we work around this when out second keyword ( car yacht ball)

is not the last word, as i notice the query looks at last values for this second keyword, such as if we adjsuted the initial dataset to:

 

DATA Colour_text;

input text $CHAR38.;

datalines;

i want a BLUE BALL please

did you see a blue car go by

i have owned a yellow yacht before

id never buy a red ball

we want a turqoise car

she said it was a yellow plane toy

he drove in his green Porsche

i didnt like his black heart

he had a grey ball

i saw a white plane

;

run;

 

I also appreciate the note on the macro relevance, and the importance of using the alternative method, as i also presume my actual list of variables would be close to exceeding those character limits, which makes your approach all the more attractive

 

thanks once again

DanielSantos
Barite | Level 11

Hi MR_E.

 

It's actually a bit simpler for your case, here's the code modified to keep rows which will match the parse.

 

* load colours names into list;

data _null_;
     set colours;
     call symput('NAMES',catx('|',symget('NAMES'),upcase(NAME)));
run;


%put NAMES=&NAMES;

 

* keep matches;

data want;
     set Colour_text;
     if prxmatch("/[a-z]*(&NAMES;)([a-z]| car| ball| yacht)/i",text);
run;

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

MR_E
Obsidian | Level 7

wow - i removed the smiley face 😉 and it produced the exact results i wanted - thankyou

 

however, and whilst i hate to caveat this, i feared this all along, there is a limit on the characters in this approach and it becomes unstable , whilst it still provided the correct results:

 

SYMBOLGEN: Macro variable NAMES resolves to BLUE YELLOW GREEN RED BLACK WHITE BLUE YELLOW GREEN RED BLACK WHITE BLUE YELLOW GREEN

RED BLACK WHITE BLUE YELLOW GREEN RED BLACK

WHITE|BLUE|YELLOW|GREEN|RED|BLACK|WHITE|BLUE|YELLOW|GREEN|RED|BLACK|WHITE|BLUE|YELLOW|GREEN|RED|BLACK|WHITE|BLUE|YELLOW|

GREEN|RED|BLACK|WHITE

WARNING: The quoted string currently being processed has become more than 262 bytes long. You might have unbalanced quotation

marks.

 

This concerns me as my actual list is at least 4000 variables long and will likely grow going forwards, which makes the above issue even more worrying.

 

Whilst PRXMATCH as you have done is my preffered method, i think the Call reference routine may not be viable this time (please tell me im wrong).

 

Thank you regardless of the outcome

 

 

Tom
Super User Tom
Super User

To make this handle large cases remove the macro variable and add DO loops.

 

Something like this.  Read more about how to use REGEX functions in SAS and you could possible optimize it more.

 

data want;
  array regex (4000) $200 _temporary_ ;
* load colours names into list;
  if _n_=1 then do p=1 to nobs ;
     set colours point=p nobs=nobs;
     * Build REGEX for this "NAME" ;
     regex(p)=cats("/[a-z]*(",NAME,")([a-z]| car| ball| yacht)/i");
  end;
* keep matches;
   set Colour_text;
   found=0;
   do i=1 to nobs until(found);
      found=prxmatch(regex(i),text);
   end;
   if found;
run;
 
MR_E
Obsidian | Level 7

@Tom and @DanielSantos thankyou both for this solution, this has helped me modify my exact query to exactly how i wanted, being able to also include a second array fairly easily which was also require (additional requirements)

 

Thank you all for your assistance, its great to see Fantastic minds at work

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, the simplest way, as you have two sets of values - again, there are various different ways - can be found below.

Edit, included a plain Base SAS example further below which shows data modelling change, merging and arrays to achieve the same result.

 

data colour_text;
  input text $char38.;
datalines;
i want a BLUE BALL please
did you see a blue car go by
i have owned a yellow yacht before
id never buy a red ball
we want a turqoise car
she said it was a yellow plane toy
he drove in his green Porsche
i didnt like his black heart
he had a grey ball
i saw a white plane
;
run;

data colours;
  input name $char10.;
datalines;
BLUE
YELLOW
GREEN
RED
BLACK
WHITE
;
run;

data types;
  input type $char15.;
datalines;
CAR
BALL
YACHT
;
run;

data _null_;
  set colours end=last;
  if _n_=1 then call execute('data want; set colour_text;');
  call execute(cats('if found < 1 then found=index(upcase(text),"',name,'");'));
  if last then call execute('run;');
run;

data _null_;
  set types end=last;
  if _n_=1 then call execute('data want; set want;');
  call execute(cats('if found2 < 1 then found2=index(upcase(text),"',type,'");'));
  if last then call execute('run;');
run;

data want (drop=found found2);
  set want (where=(found > 0 and found2 > 0));
run;

 

/* Note same test data as above */
data types;
  input type $char15.;
datalines;
CAR
BALL
YACHT
;
run;

proc transpose data=colours out=t_colors prefix=c;
  var name;
run;
proc transpose data=types out=t_types prefix=t;
  var type;
run;
data total;
  merge t_colors t_types;
run;

proc sql;
  create table WANT as
  select  A.*,
          B.*
  from    COLOUR_TEXT A
  left join TOTAL B
  on      1=1;
quit;

data want (keep=text);
  set want;
  array c c:;
  array t t:;
  do i=1 to dim(c);
    if findw(upcase(text),strip(c{i})) then found=sum(found,1);
  end;
  do i=1 to dim(t);
    if findw(upcase(text),strip(t{i})) then found2=sum(found2,1);
  end;
  if found > 0 and found2 > 0 then output;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 3106 views
  • 8 likes
  • 4 in conversation