I am trying to flag certain words in 3 different string variables using a macro variable list. For example, I created a macro variable containing a list of US state names.
proc sql noprint;
select staten
into :statenames separated by '|'
from sashelp.tgrmaps;
quit;
First I tried to use the findw function but it did not flag any records. The code works with a single word as the argument so I suspect it may be searching for the entire list rather than the individual names. I wasn't able to work out how to fix that.
if %sysfunc(findw(var1,&statenames, ,i))>=1 then state_flag=1;
else if %sysfunc(findw(var2,&statenames, ,i))>=1 then state_flag=1;
else if %sysfunc(findw(var3,&statenames, ,i))>=1 then state_flag=1;
else state_flag=0;
I then tried to use pattern matching as shown below. However my dataset is very large (~7 million) records and it seems this method is too slow and does not run even after 2+ hours.
if prxmatch("/&statenames/i",var1)>0 then state_flag=1;
else if prxmatch("/&statenames/i",var2)>0 then state_flag=1;
else if prxmatch("/&statenames/i",var3)>0 then state_flag=1;
else state_flag=0;
@martyvd wrote:
The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like.
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; if findw("&statenames",var1,'|','i')>=1 then state_flag=1; else if findw("&statenames",var2,'|','i')>=1 then state_flag=1; else if findw("&statenames",var3,'|','i')>=1 then state_flag=1; else state_flag=0; run;
The code should flag this record because it contains 'Alabama' but it does not as currently written.
No it should not find "Alabama" as you are searching for the value 'I live in Alabama'.
If you want to find Alabama then you have to parse out each word in the value of Var1 and search for that individually.
Easy enough to demonstrate:
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; if findw("Alabama Alaska",var1,'|','i')>=1 then state_flag=1; run;
So try something more like:
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; length word $ 25; do i=1 to countw(var1); word=scan(var1,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;
Need to make sure that the temporary variable Word is defined to be long enough to contain likely "words" in your phrases. You will need Strip in the Findw because otherwise the variable Word will be padded to its length with blanks and likely not found in the phrase.
The Leave instruction will terminate the loop over the words in Var1 as soon as the first match is found.
You would want to add two more loops for var2 and var3 OR concatenate Var1, var2 and Var3 to a single phrase to use in the loop to search.
data test; var1='I went to my doctor today'; var2='My blood pressure is high'; var3='I live in Alabama and am likely to stay there'; output; data want; set test; length word $ 25; searchstr = catx(' ', var1,var2,var3);
state_flag=0; do i=1 to countw(searchstr); word=scan(searchstr,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;
You would drop the i, searchstr and likely Word temporary variables though leaving Word in tells you which match was used to set the flag. Using the Searchstr variable likely need to provide a Length value for that as well.
To tell you the truth, showing us portions of the data step, and not the whole data step, hinders our ability to answer your questions. In addition, not showing us portions of the data (specifically, typical contents of var1, var2 and var3) also hinders our ability to answer your questions. And I don't seem to have sashelp.tgrmaps in SAS OnDemand for Academics.
Nevertheless, this may work (depending on your data)
proc sql noprint;
select distinct statecode
into :statecodes separated by '|'
from sashelp.zipcode;
quit;
%put &=statecodes;
data have;
var1='CT';
var2='BU';
output;
var1='BU';
var2='AG';
output;
run;
data want;
set have;
if findw("&statecodes",var1,'|','i')>=1 then state_flag=1;
else if findw("&statecodes",var2,'|','i')>=1 then state_flag=1;
else state_flag=0;
run;
If it doesn't work, then provide the information requested above.
The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like.
data test;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;
data want;
set test;
if findw("&statenames",var1,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var2,'|','i')>=1 then state_flag=1;
else if findw("&statenames",var3,'|','i')>=1 then state_flag=1;
else state_flag=0;
run;
The code should flag this record because it contains 'Alabama' but it does not as currently written.
@martyvd wrote:
The whole data step is as simple as you wrote it which is why it was not included. Here are some examples of what var1, var2, and var3 might look like.
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; if findw("&statenames",var1,'|','i')>=1 then state_flag=1; else if findw("&statenames",var2,'|','i')>=1 then state_flag=1; else if findw("&statenames",var3,'|','i')>=1 then state_flag=1; else state_flag=0; run;
The code should flag this record because it contains 'Alabama' but it does not as currently written.
No it should not find "Alabama" as you are searching for the value 'I live in Alabama'.
If you want to find Alabama then you have to parse out each word in the value of Var1 and search for that individually.
Easy enough to demonstrate:
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; if findw("Alabama Alaska",var1,'|','i')>=1 then state_flag=1; run;
So try something more like:
data test; var1='I live in Alabama'; var2='My blood pressure is high'; var3='I went to my doctor today'; output; data want; set test; length word $ 25; do i=1 to countw(var1); word=scan(var1,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;
Need to make sure that the temporary variable Word is defined to be long enough to contain likely "words" in your phrases. You will need Strip in the Findw because otherwise the variable Word will be padded to its length with blanks and likely not found in the phrase.
The Leave instruction will terminate the loop over the words in Var1 as soon as the first match is found.
You would want to add two more loops for var2 and var3 OR concatenate Var1, var2 and Var3 to a single phrase to use in the loop to search.
data test; var1='I went to my doctor today'; var2='My blood pressure is high'; var3='I live in Alabama and am likely to stay there'; output; data want; set test; length word $ 25; searchstr = catx(' ', var1,var2,var3);
state_flag=0; do i=1 to countw(searchstr); word=scan(searchstr,i); if findw("Alabama|Alaska",strip(word),'|','i')>=1 then do; state_flag=1; leave; end; end; run;
You would drop the i, searchstr and likely Word temporary variables though leaving Word in tells you which match was used to set the flag. Using the Searchstr variable likely need to provide a Length value for that as well.
If a temporary array approach is an option try this approach instead. Because temporary arrays are loaded into memory this may be faster.
https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf
SASHELP.TGRMAPS seems to be installed only with SAS/GIS, which is sort of deprecated for most users.
First thing, have you LOOKED at your Statenames macro variable? You may have many repeats as the description I find for that data set involves FIPS codes and states. So it is quite possible that a state appears many times in the variable.
You might try
proc sql noprint; select Distinct staten into :statenames separated by '|' from sashelp.tgrmaps; quit;
to reduce the size of the macro variable and see if that speeds up the solution you have that seems to "work" with Prxmatch.
Is this really looking through random text to see if one (or more) state names appears in the text?
You may have to take another pass or two through the data if that is the case as state names may appear in compound words in places that the value is not actually a state. Just a local example: in Boise, Idaho we have a "New York Canal". Is that intended to be a match for your process? Other place names like mountain, rivers and streams, roads, restaurants to start a list have states as part of their names.
I confirmed that the statenames macro variable contains only the names of 50 states + DC with no duplicates.
Yes, I am really trying to look through text to see if one or more state names appear. My objective is to identify records where individuals may have entered personal information such as their location. I am not concerned about where state names may appear in compound words. Flagging those may actually be helpful.
data test;
length var1 var2 var3 $100.;
var1='I live in Alabama';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;
var1='I live in Canada';
var2='My blood pressure is high';
var3='I went to my doctor today';
output;
var1='I live in Canada';
var2='My blood pressure is high in Montana';
var3='I went to my doctor today';
output;
run;
proc sql;
create table us_states as select statename from sashelp.us_data;
quit;
%let num_search_terms = &sqlobs.;
%put &num_search_terms;
data flagged;
*declare array;
array _search(&num_search_terms.) $100. _temporary_;
/*2*/
*load array into memory;
if _n_=1 then
do j=1 to &num_search_terms.;
set us_states;
_search(j)=statename;
end;
set test;
array _var(3) var1-var3;
*set flag to 0 for initial start;
state_flag=0;
/*3*/
*loop through and craete flag;
do i=1 to &num_search_terms. while(state_flag=0);
/*4*/
do k=1 to 3 while(state_flag=0);
if find(_var(k), _search(i), 'it')>0 then
state_flag=1;
end;
end;
drop i j k;
run;
@martyvd wrote:
I confirmed that the statenames macro variable contains only the names of 50 states + DC with no duplicates.
Yes, I am really trying to look through text to see if one or more state names appear. My objective is to identify records where individuals may have entered personal information such as their location. I am not concerned about where state names may appear in compound words. Flagging those may actually be helpful.
If you want to directly search if ANY of a list of values appears in free text then a regular expression is probably going to be the shortest code. The pipe character is used in a regular expression to list a set of alternative possible matches.
Example:
data have;
infile cards dsd truncover ;
input (var1 var2) (:$30.) ;
cards;
I live in Alabama,
,My blood pressure is high
I live in Kansas,My blood pressure is low
;
data want ;
set have;
found=0<prxmatch('/\b(alabama|alaska|arkansas)\b/i',catx(' ',of var1-var2));
run;
Result
OBS var1 var2 found 1 I live in Alabama 1 2 My blood pressure is high 0 3 I live in Kansas My blood pressure is low 0
You could have the list of states in a macro variable:
%let states=alabama|alaska|arkansas;
data want ;
set have;
found=0<prxmatch("/\b(&states)\b/i",catx(' ',of var1-var2));
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.