Hi there,
I am losing it over what I think is basic straight forward code. I am joining have_1 which is a list of store names with have_2 which is a concordance between store name variations and base store names. When I run the following code I get inconsistent results:
data have_1;
infile datalines dlm = ':' truncover;
input store $CHAR50.;
datalines;
SUPER C SHAWNIGAN
IGA
IGA
FRESON BROS. IGA
LOBLAW SUPERSTORE
SUPERSTORE LOBLAW
foodlang iga
IGA MARCHE
ROOTS BAYSHORE CENTRE
;
run;
data have_2;
infile datalines dlm = ':' truncover;
input base_store $ store_variation $;
datalines;
SUPER C:SUPER C
ROOTS:ROOTS
IGA:IGA
LOBLAWS:LOBLAW
;
run;
proc sql;
create table want as
select a.*, b.base_store
from have_1 a left join have_2 b
on upcase(a.store) like ('%'||b.store_variation||'%');
quit;
Results:
store base_store
SUPER C SHAWNIGAN SUPER C
IGA IGA
IGA IGA
FRESON BROS. IGA IGA
foodlang IGA IGA
SUPERSTORE LOBLAW LOBLAWS
LOBLAW SUPERSTORE
IGA MARCHE
ROOTS BAYSHORE CENTRE
My question is, why does the code make a match consistently when the store_variation is on its own or at the end of the store value, but not when the store_variation is at the beginning. Also, why do you think it matched at the beginning for SUPER C SHAWNIGAN, but not LOBLAW SUPERSTORE, IGA MARCHE or ROOTS BAYSHORE CENTRE?
I'm puzzled and need to figure this out. Please help!
@sas-inquirer wrote:
Hi Reeza,
I've tried STRIP with CONTAINS, FIND with TRIM, INDEX and FINDW but I found with each I did not get the desired result. With these, it takes care of not coding SUPER C SHAWNIGAN to IGA, but then neglects to code FRESON BROS. IGA to IGA. Thanks for your suggestion. I could use FIND/FINDW for the single word cases, but I would still need to hardcode if index(STORE," IGA " to map IGA to FRESON BROS. IGA.
If you want to add spaces around the search term and use the INDEX() function then make sure to also add spaces around the string you are searching so it will match at the ends of the string.
index(cat(' ',string,' '),cat(' ',strip(word),' '))
You don't need to add the spaces if you use INDEXW() instead. Another advantage of using INDEXW() is that it will also work for finding IGA in strings like 'BROS.IGA'. You can use the optional third argument to identify what characters are used to indicate word boundaries. The default will include punctuation like periods.
indexw(string,strip(word))
It might even be easier to use FINDW() instead. Then you can use its modifiers to have it do case insensitive search and strip the blanks for you.
findw(string,word,,'spit')
You have the wildcards at the beginning and end to allow more flexibility but nothing in the middle. If its more of a search, you can use the FIND/INDEX function instead. Or you can try a COMPGED or some other function to do a fuzzy match.
SAS variables are fixed length and padded with spaces. You are adding a lot of spaces before the trailing % wildcard.
upcase(a.store) like ('%'||b.store_variation||'%');
Since you variable has the default length of $8 when store variation is 'IGA' then contain 'IGA '.
Don't include the spaces in the value.
upcase(a.store) like cats('%',b.store_variation,'%');
For single word matches can you try FIND() or FINDW() instead?
FINDW/INDEXW looks for a full word match, whereas FIND/INDEX will find embedded matches, such as in a word.
To me, treating the single word conditions differently would still categorize the IGA one.
Unfortunately with this type of search each approach has it's pro/cons so you have to pick the best that works, or use multiple approaches.
Open Refine is not a bad tool to do this work either.
@sas-inquirer wrote:
Hi Reeza,
I've tried STRIP with CONTAINS, FIND with TRIM, INDEX and FINDW but I found with each I did not get the desired result. With these, it takes care of not coding SUPER C SHAWNIGAN to IGA, but then neglects to code FRESON BROS. IGA to IGA. Thanks for your suggestion. I could use FIND/FINDW for the single word cases, but I would still need to hardcode if index(STORE," IGA " to map IGA to FRESON BROS. IGA.
If you want to add spaces around the search term and use the INDEX() function then make sure to also add spaces around the string you are searching so it will match at the ends of the string.
index(cat(' ',string,' '),cat(' ',strip(word),' '))
You don't need to add the spaces if you use INDEXW() instead. Another advantage of using INDEXW() is that it will also work for finding IGA in strings like 'BROS.IGA'. You can use the optional third argument to identify what characters are used to indicate word boundaries. The default will include punctuation like periods.
indexw(string,strip(word))
It might even be easier to use FINDW() instead. Then you can use its modifiers to have it do case insensitive search and strip the blanks for you.
findw(string,word,,'spit')
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.