BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
martyvd
Fluorite | Level 6

I am trying to flag certain words in 3 different string variables using a macro variable list. For example, I created a macro variable containing a list of US state names. 

 

proc sql noprint;                                                                                                                       
  select staten
  into :statenames separated by '|'                                                                                                     
  from sashelp.tgrmaps;                                                                                                                            
quit;

First I tried to use the findw function but it did not flag any records. The code works with a single word as the argument so I suspect it may be searching for the entire list rather than the individual names. I wasn't able to work out how to fix that. 

 

if %sysfunc(findw(var1,&statenames, ,i))>=1 then state_flag=1;
else if %sysfunc(findw(var2,&statenames, ,i))>=1 then state_flag=1;
else if %sysfunc(findw(var3,&statenames, ,i))>=1 then state_flag=1;
else state_flag=0;

 

I then tried to use pattern matching as shown below. However my dataset is very large (~7 million) records and it seems this method is too slow and does not run even after 2+ hours. 

 

if prxmatch("/&statenames/i",var1)>0 then state_flag=1;
else if prxmatch("/&statenames/i",var2)>0 then state_flag=1;
else if prxmatch("/&statenames/i",var3)>0 then state_flag=1;
else state_flag=0; 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@martyvd wrote:

The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like. 

 

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;

if findw("&statenames",var1,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var2,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var3,'|','i')>=1 then state_flag=1;
else state_flag=0;
run;

The code should flag this record because it contains 'Alabama' but it does not as currently written. 


No it should not find "Alabama" as you are searching for the value 'I live in Alabama'.

If you want to find Alabama then you have to parse out each word in the value of Var1 and search for that individually.

Easy enough to demonstrate:

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;

if findw("Alabama Alaska",var1,'|','i')>=1 then state_flag=1;
run;

So try something more like:

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;
length word $ 25;
do i=1 to countw(var1);
   word=scan(var1,i);
   if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do;
      state_flag=1;
      leave;
   end;
end;
run;

Need to make sure that the temporary variable Word is defined to be long enough to contain likely "words" in your phrases. You will need Strip in the Findw because otherwise the variable Word will be padded to its length with blanks and likely not found in the phrase.

The Leave instruction will terminate the loop over the words in Var1 as soon as the first match is found.

You would want to add two more loops for var2 and var3 OR concatenate Var1, var2 and Var3 to a single phrase to use in the loop to search.

data test;
var1='I went to my doctor today';
var2='My blood pressure is high';
var3='I live in Alabama and am likely to stay there';
output;

data want;
set test;
length word $ 25;
searchstr = catx(' ', var1,var2,var3);
state_flag=0; do i=1 to countw(searchstr); word=scan(searchstr,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;

You would drop the i, searchstr and likely Word temporary variables though leaving Word in tells you which match was used to set the flag. Using the Searchstr variable likely need to provide a Length value for that as well.

 

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

To tell you the truth, showing us portions of the data step, and not the whole data step, hinders our ability to answer your questions. In addition, not showing us portions of the data (specifically, typical contents of var1, var2 and var3) also hinders our ability to answer your questions. And I don't seem to have sashelp.tgrmaps in SAS OnDemand for Academics.

 

Nevertheless, this may work (depending on your data)

 

proc sql noprint;
  select distinct statecode
  into :statecodes separated by '|'
  from sashelp.zipcode;
quit;
%put &=statecodes;

data have;
    var1='CT';
    var2='BU';
    output;
    var1='BU';
    var2='AG';
    output;
run;
data want;
    set have;
    if findw("&statecodes",var1,'|','i')>=1 then state_flag=1;
	else if findw("&statecodes",var2,'|','i')>=1 then state_flag=1;
	else state_flag=0;
run;

 

If it doesn't work, then provide the information requested above.

--
Paige Miller
martyvd
Fluorite | Level 6

The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like. 

 

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;

if findw("&statenames",var1,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var2,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var3,'|','i')>=1 then state_flag=1;
else state_flag=0;
run;

The code should flag this record because it contains 'Alabama' but it does not as currently written. 

ballardw
Super User

@martyvd wrote:

The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like. 

 

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;

if findw("&statenames",var1,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var2,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var3,'|','i')>=1 then state_flag=1;
else state_flag=0;
run;

The code should flag this record because it contains 'Alabama' but it does not as currently written. 


No it should not find "Alabama" as you are searching for the value 'I live in Alabama'.

If you want to find Alabama then you have to parse out each word in the value of Var1 and search for that individually.

Easy enough to demonstrate:

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;

if findw("Alabama Alaska",var1,'|','i')>=1 then state_flag=1;
run;

So try something more like:

data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;

data want;
set test;
length word $ 25;
do i=1 to countw(var1);
   word=scan(var1,i);
   if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do;
      state_flag=1;
      leave;
   end;
end;
run;

Need to make sure that the temporary variable Word is defined to be long enough to contain likely "words" in your phrases. You will need Strip in the Findw because otherwise the variable Word will be padded to its length with blanks and likely not found in the phrase.

The Leave instruction will terminate the loop over the words in Var1 as soon as the first match is found.

You would want to add two more loops for var2 and var3 OR concatenate Var1, var2 and Var3 to a single phrase to use in the loop to search.

data test;
var1='I went to my doctor today';
var2='My blood pressure is high';
var3='I live in Alabama and am likely to stay there';
output;

data want;
set test;
length word $ 25;
searchstr = catx(' ', var1,var2,var3);
state_flag=0; do i=1 to countw(searchstr); word=scan(searchstr,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;

You would drop the i, searchstr and likely Word temporary variables though leaving Word in tells you which match was used to set the flag. Using the Searchstr variable likely need to provide a Length value for that as well.

 

martyvd
Fluorite | Level 6
Thank you, this seems to be working.
Reeza
Super User
Why are you using %SYSFUNC in an IF/ELSE without macro logic? I still don't think FINDW will work but that code is confusing.
Reeza
Super User

If a temporary array approach is an option try this approach instead. Because temporary arrays are loaded into memory this may be faster.

 

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

 

 

ballardw
Super User

SASHELP.TGRMAPS seems to be installed only with SAS/GIS, which is sort of deprecated for most users.

 

First thing, have you LOOKED at your Statenames macro variable? You may have many repeats as the description I find for that data set involves FIPS codes and states. So it is quite possible that a state appears many times in the variable.

 

You might try

proc sql noprint;                                                                                                                       
  select Distinct staten
  into :statenames separated by '|'                                                                                                     
  from sashelp.tgrmaps;                                                                                                                            
quit;

to reduce the size of the macro variable and see if that speeds up the solution you have that seems to "work" with Prxmatch.

 

Is this really looking through random text to see if one (or more) state names appears in the text?

You may have to take another pass or two through the data if that is the case as state names may appear in compound words in places that the value is not actually a state. Just a local example: in Boise, Idaho we have a "New York Canal". Is that intended to be a match for your process? Other place names like mountain, rivers and streams, roads, restaurants to start a list have states as part of their names.

martyvd
Fluorite | Level 6

I confirmed that the statenames macro variable contains only the names of 50 states + DC with no duplicates. 

 

Yes, I am really trying to look through text to see if one or more state names appear. My objective is to identify records where individuals may have entered personal information such as their location. I am not concerned about where state names may appear in compound words. Flagging those may actually be helpful. 

Reeza
Super User
data test;
    length var1 var2 var3 $100.;
    var1='I live in Alabama';
    var2='My blood pressure is high';
    var3='I went to my doctor today';
    output;
    var1='I live in Canada';
    var2='My blood pressure is high';
    var3='I went to my doctor today';
    output;
    var1='I live in Canada';
    var2='My blood pressure is high in Montana';
    var3='I went to my doctor today';
    output;
run;

proc sql;
    create table us_states as select statename from sashelp.us_data;
quit;

%let num_search_terms = &sqlobs.;
%put &num_search_terms;

data flagged;
    *declare array;
    array _search(&num_search_terms.) $100. _temporary_;

    /*2*/
    *load array into memory;

    if _n_=1 then
        do j=1 to &num_search_terms.;
            set us_states;
            _search(j)=statename;
        end;
    set test;
    array _var(3) var1-var3;
    *set flag to 0 for initial start;
    state_flag=0;

    /*3*/
    *loop through and craete flag;

    do i=1 to &num_search_terms. while(state_flag=0);

        /*4*/
        do k=1 to 3 while(state_flag=0);

            if find(_var(k), _search(i), 'it')>0 then
                state_flag=1;
        end;
    end;
    drop i j k;
run;
Tom
Super User Tom
Super User

@martyvd wrote:

I confirmed that the statenames macro variable contains only the names of 50 states + DC with no duplicates. 

 

Yes, I am really trying to look through text to see if one or more state names appear. My objective is to identify records where individuals may have entered personal information such as their location. I am not concerned about where state names may appear in compound words. Flagging those may actually be helpful. 


If you want to directly search if ANY of a list of values appears in free text then a regular expression is probably going to be the shortest code.  The pipe character is used in a regular expression to list a set of alternative possible matches.

 

Example:

data have;
  infile cards dsd truncover ;
  input (var1 var2) (:$30.) ;
cards;
I live in Alabama,
,My blood pressure is high
I live in Kansas,My blood pressure is low
;

data want ;
  set have;
  found=0<prxmatch('/\b(alabama|alaska|arkansas)\b/i',catx(' ',of var1-var2));
run;

Result

OBS          var1                     var2               found

 1     I live in Alabama                                   1
 2                          My blood pressure is high      0
 3     I live in Kansas     My blood pressure is low       0

You could have the list of states in a macro variable:

%let states=alabama|alaska|arkansas;
data want ;
  set have;
  found=0<prxmatch("/\b(&states)\b/i",catx(' ',of var1-var2));
run;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1180 views
  • 0 likes
  • 5 in conversation