BookmarkSubscribeRSS Feed
SM1
Obsidian | Level 7 SM1
Obsidian | Level 7

Hello,

 

My data set has 5 narrative text fields that accept open-ended responses. I need to figure out a way to scan through these variables and flag observations that meet specific criteria. For example, I want to create a flag if the words "NORMANDY" and "OVERBOARD" both appear in any order in at least 1 of the 5 narrative text fields.

 

Unsuccessful syntax:

 

data want;
    set have;
    array findtxt (5) narrative1 narrative2 narrative3 narrative4 narrative5;
        do i = 1 to 5;
            if prxmatch ("/(NORMANDY)(OVERBOARD)/i",findtxt(i)) then Flag = 1;
        end;
    drop i;
run;

 

Sort of successful syntax:

 

data want;
    set have;
    array findtxt (5) narrative1 narrative2 narrative3 narrative4 narrative5;
       do i = 1 to 5;
          if prxmatch ("/NORMANDY.*OVERBOARD/i",findtxt(i)) then Flag = 1; /*only flags if 'NORMANDY' comes before 'OVERBOARD'*/
        end;
    drop i;
run;

 

Since I want to flag each observation that has 'NORMANDY' and 'OVERBOARD' in any of the 5 narrative fields, including those where 'OVERBOARD' appears before 'NORMANDY', this syntax won't work.

 

Any suggestions for how to write the PRXMATCH function correctly?

 

Thanks!

 

10 REPLIES 10
novinosrin
Tourmaline | Level 20

Since you mentioned -->For example, I want to create a flag if the words "NORMANDY" and "OVERBOARD" both appear in any order in at least 1 of the 5 narrative text fields.

 

Would indexw or findw be a better choice in your case. Just my 2 cents

 

Something like if indexw(text,'NORMANDY')>0 or indexw(text,'OVERBOARD')>0 then flag=1 ;

SM1
Obsidian | Level 7 SM1
Obsidian | Level 7

@novinosrin

 

I don't think so because I'm trying to identify discrete words (i.e., 'normandy' and 'overboard') that can appear in any order instead of a string with multiple words (e.g., 'sailor fell overboard the USS Normandy').

 

But I'm happy to be wrong if I can just get the syntax to work properly!

 

 

ballardw
Super User

@SM1 wrote:

@novinosrin

 

I don't think so because I'm trying to identify discrete words (i.e., 'normandy' and 'overboard') that can appear in any order instead of a string with multiple words (e.g., 'sailor fell overboard the USS Normandy').

 

But I'm happy to be wrong if I can just get the syntax to work properly!

 

 


It may help to provide some example data. I am not sure at this point if your narrative variables are single words or not. If they are I do not see how your "sort of working" code would work at all is it is only looking at single variable at a time. You only need to supply the narrative variables and best would be a data step that would let us test code.

 

A solution if the values are single words in 5 different variables:

data want;

    informat narrative1 narrative2 narrative3 narrative4 narrative5 $15.;
    input narrative1 narrative2 narrative3 narrative4 narrative5;
    array findtxt (5) narrative1 narrative2 narrative3 narrative4 narrative5;
    if prxmatch ("/NORMANDY.*OVERBOARD|OVERBOARD.*NORMANDY/i",catx(' ',of findtxt(*))) then Flag2 = 1; /*only flags if 'NORMANDY' comes before 'OVERBOARD'*/
     
    drop i;
datalines;
NORMANDY OVERBOARD SOMETHING ELSE OTHER
OVERBOARD NORMANDY SOMETHING ELSE OTHER
NORMANDY SOMETHING ELSE OVERBOARD OTHER
NORMANDY NOTHING SOMETHING ELSE OTHER
NOTHING OVERBOARD SOMETHING ELSE OTHER
NORMANDY NORMANDY NORMANDY NORMANDY OVERBOARD 
run;

but without better examples of your data this is purely a guess.

 

SM1
Obsidian | Level 7 SM1
Obsidian | Level 7

@ballardw

 

Some sample data:

 

ID

Narrative1

Narrative2

Narrative3

Narrative4

Narrative5

16012

AIRCRAFT CALLED TO ASSIST WITH OVERBOARD SAILOR FROM USS NORMANDY

PARALLEL SEARCH PATTERN EMPLOYED TO LOCATE USS NORMANDY SAILOR REPORTED OVERBOARD

OVERBOARD SAILOR FROM USS NORMANDY NOT LOCATED

N/A

AIRCRAFT RETURNED TO BASE

 

Narrative1 and Narrative3 = OVERBOARD appears before NORMANDY

Narrative2 = NORMANDY appears before OVERBOARD

 

If I could relax the flag criteria to select if NORMANDY or OVERBOARD, then I know this syntax would work:

 

 if prxmatch ("/NORMANDY|OVERBOARD/i",findtxt(i)) then Flag = 1;

 

But I don't know how to write the syntax so that it will match only if NORMANDY and OVERBOARD are in a variable.

Tom
Super User Tom
Super User

@SM1 wrote:

@ballardw

 

Some sample data:

 

ID

Narrative1

Narrative2

Narrative3

Narrative4

Narrative5

16012

AIRCRAFT CALLED TO ASSIST WITH OVERBOARD SAILOR FROM USS NORMANDY

PARALLEL SEARCH PATTERN EMPLOYED TO LOCATE USS NORMANDY SAILOR REPORTED OVERBOARD

OVERBOARD SAILOR FROM USS NORMANDY NOT LOCATED

N/A

AIRCRAFT RETURNED TO BASE

 

Narrative1 and Narrative3 = OVERBOARD appears before NORMANDY

Narrative2 = NORMANDY appears before OVERBOARD

 

If I could relax the flag criteria to select if NORMANDY or OVERBOARD, then I know this syntax would work:

 

 if prxmatch ("/NORMANDY|OVERBOARD/i",findtxt(i)) then Flag = 1;

 

But I don't know how to write the syntax so that it will match only if NORMANDY and OVERBOARD are in a variable.


First that it NOT how to post sample data. Either post it as plain delimited text, or better post as a functional data step.

data example ;
  infile cards truncover ;
  input id (n1-n5) (/ $100.) ;
cards4;
16012
AIRCRAFT CALLED TO ASSIST WITH OVERBOARD SAILOR FROM USS NORMANDY
PARALLEL SEARCH PATTERN EMPLOYED TO LOCATE USS NORMANDY SAILOR REPORTED OVERBOARD
OVERBOARD SAILOR FROM USS NORMANDY NOT LOCATED
N/A
AIRCRAFT RETURNED TO BASE
;;;;

Second if you want to test a string for the existence of two separate words in any order then just use two separate tests.

 

Instead of  logic like this

if test_to_find_a_and_B then 

your logic will be something like this

if test_to_find_a  AND test_to_find_b then

The FINDW() function is much easier to use than regular expressions for this simple problem.  To test if N1 contains OVERBOARD you could just use something like this

findw(n1,'overboard',,'ips')

To test multiple variables just add an ARRAY and a DO loop. 

 

So your data step would look like this:

data want ;
  set example;
  array str n1-n5 ;
  found=0;
  do i=1 to dim(str) until (found);
     found = findw(str(i),'overboard',,'ips') and findw(str(i),'normandy',,'ips');
  end;
run;

 

 

 

s_lassen
Meteorite | Level 14

This is not a very good case for using a single PRXMATCH call. But of course it can be done:

data want;
    set have;
    array findtxt (5) narrative1 narrative2 narrative3 narrative4 narrative5;
       do i = 1 to 5;
          if prxmatch ("/(?:NORMANDY.*OVERBOARD|OVERBOARD.*NORMANDY)/i",findtxt(i)) then Flag = 1; 
        end;
    drop i;
run;

But I would prefer the FINDW function:

data want;
    set have;
    array findtxt (5) narrative1 narrative2 narrative3 narrative4 narrative5;
       do i = 1 to 5 until(flag=1);
          if findw(findtxt(i),'NORMANDY',,'i') and findw(findtxt(i),'OVERBOARD',,'i') then Flag = 1; 
        end;
    drop i;
run;

And then I put in an UNTIL, as there is no reason to keep looking once a match is found.

And actually, you should put in some \b markers in the PRXMATCH string, so that you only get whole words in the result; FINDW does that for you.

SM1
Obsidian | Level 7 SM1
Obsidian | Level 7

@s_lassen

 

Why is the syntax for PRXMATCH so complicated if you want "match only if 'sub-stringA' and 'sub-stringB' ", rather than "match if 'sub-stringA' or 'sub-stringB' "?

 

The syntax you provided suggests that if the flag criteria requires a match with 2 or more sub-strings, you have to actually specify the order of appearance for each sub-string. If you wanted to match only if wordA, wordB, wordC, and wordD appeared anywhere and in order in a variable, it would be really difficult to use PRXMATCH, right?

 

 

ballardw
Super User

@SM1 wrote:

@s_lassen

 

Why is the syntax for PRXMATCH so complicated if you want "match only if 'sub-stringA' and 'sub-stringB' ", rather than "match if 'sub-stringA' or 'sub-stringB' "?

 

The syntax you provided suggests that if the flag criteria requires a match with 2 or more sub-strings, you have to actually specify the order of appearance for each sub-string. If you wanted to match only if wordA, wordB, wordC, and wordD appeared anywhere and in order in a variable, it would be really difficult to use PRXMATCH, right?

 

 


Prxmatch Searches for a pattern match and returns the position at which the pattern is found.

A pattern => one pattern as the basic form. The user is left to add the functionality using OR to "a followed by b" or "b followed by a".

 

For "If you wanted to match only if wordA, wordB, wordC, and wordD appeared anywhere and in order in a variable, it would be really difficult to use PRXMATCH, right?"

 

I don't use the PRX functions enough to say for sure. There are lots of switches and options involved along with the wild card and repeat elements that I get lost with more than the basic functions. I think I have seen some references using nested calls that may not make this 4 word example as complex as you may be thinking.

SM1
Obsidian | Level 7 SM1
Obsidian | Level 7

@s_lassen

 

I ran the suggested syntax; no errors were reported.

 

The PRXMATCH syntax properly flagged the observations (n=14) that met the criteria.

 

The FINDW syntax didn't flag any of the observations (n=0), although the new variable FLAG was created.

 

It seems like there should be a way to specify more than 1 parameter to match without having to specify the universe of acceptable temporal orders for the parameters.

 

 

s_lassen
Meteorite | Level 14

@SM1 wrote:

@s_lassen

 

I ran the suggested syntax; no errors were reported.

 

The PRXMATCH syntax properly flagged the observations (n=14) that met the criteria.

 

The FINDW syntax didn't flag any of the observations (n=0), although the new variable FLAG was created.

 

It seems like there should be a way to specify more than 1 parameter to match without having to specify the universe of acceptable temporal orders for the parameters.

 

 


If you do not get any matches with the FINDW method, that is probably because FINDW looks for whole words, and for some reason your keywords are not properly delimited. When I test against the test data supplied by @ballardw, the results look OK.

If you are OK with not checking for whole words (as in the PRX example), you can try the FIND function instead. Or you can check the syntax of the FINDW function, and find out how you should specify the correct delimiters.

It is true that you have to specify the "universe of acceptable temporal orders", as you call it, in order to find the matches with a single match string. Unless you have a way of specifying "look for NORMANDY or OVERBOARD, and then look for the one of the two words that was not found the first time". Which would take search string syntax to a whole new level of complexity, I think.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 12501 views
  • 2 likes
  • 5 in conversation