BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas-inquirer
Quartz | Level 8

Hi there,

   I am losing it over what I think is basic straight forward code. I am joining have_1 which is a list of store names with have_2 which is a concordance between store name variations and base store names. When I run the following code I get inconsistent results:

data have_1;
infile datalines dlm = ':' truncover;
input store $CHAR50.;
datalines;
SUPER C SHAWNIGAN
 IGA
IGA
FRESON BROS. IGA
LOBLAW SUPERSTORE
SUPERSTORE LOBLAW
foodlang iga
IGA MARCHE
ROOTS BAYSHORE CENTRE
;
run;

data have_2;
infile datalines dlm = ':' truncover;
input base_store $ store_variation $;
datalines;
SUPER C:SUPER C
ROOTS:ROOTS
IGA:IGA 
LOBLAWS:LOBLAW 
;
run;


proc sql;
create table want as
select a.*, b.base_store
from have_1 a left join have_2 b
on upcase(a.store) like ('%'||b.store_variation||'%');
quit;

Results:

store         base_store

SUPER C SHAWNIGAN     SUPER C

 IGA                                     IGA

IGA                                      IGA

FRESON BROS. IGA          IGA

foodlang IGA                       IGA

SUPERSTORE LOBLAW   LOBLAWS

LOBLAW SUPERSTORE   

IGA MARCHE

ROOTS BAYSHORE CENTRE

 

My question is, why does the code make a match consistently when the store_variation is on its own or at the end of the store value, but not when the store_variation is at the beginning. Also, why do you think it matched at the beginning for SUPER C SHAWNIGAN, but not LOBLAW SUPERSTORE, IGA MARCHE or ROOTS BAYSHORE CENTRE?

 

I'm puzzled and need to figure this out. Please help!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@sas-inquirer wrote:
Hi Reeza,
I've tried STRIP with CONTAINS, FIND with TRIM, INDEX and FINDW but I found with each I did not get the desired result. With these, it takes care of not coding SUPER C SHAWNIGAN to IGA, but then neglects to code FRESON BROS. IGA to IGA. Thanks for your suggestion. I could use FIND/FINDW for the single word cases, but I would still need to hardcode if index(STORE," IGA " to map IGA to FRESON BROS. IGA.

If you want to add spaces around the search term and use the INDEX() function then make sure to also add spaces around the string you are searching so it will match at the ends of the string.

index(cat(' ',string,' '),cat(' ',strip(word),' '))

You don't need to add the spaces if you use INDEXW() instead. Another advantage of using INDEXW() is that it will also work for finding IGA in strings like 'BROS.IGA'.  You can use the optional third argument to identify what characters are used to indicate word boundaries.  The default will include punctuation like periods. 

 

indexw(string,strip(word))

 

It might even be easier to use FINDW() instead.  Then you can use its modifiers to have it do case insensitive search and strip the blanks for you.

findw(string,word,,'spit')

 

 

 

View solution in original post

9 REPLIES 9
Reeza
Super User

You have the wildcards at the beginning and end to allow more flexibility but nothing in the middle. If its more of a search, you can use the FIND/INDEX function instead.  Or you can try a COMPGED or some other function to do a fuzzy match. 

 

 

Tom
Super User Tom
Super User

SAS variables are fixed length and padded with spaces.  You are adding a lot of spaces before the trailing % wildcard.

upcase(a.store) like ('%'||b.store_variation||'%');

Since you variable has the default length of $8 when store variation is 'IGA' then contain 'IGA     '.

Don't include the spaces in the value.

upcase(a.store) like cats('%',b.store_variation,'%');
sas-inquirer
Quartz | Level 8
Hi Tom,
I like your solution. It certainly took care of the matching at beginning issue. Now assuming I had put a space before and after IGA in have_2, how could I make sure that SUPER C SHAWNIGAN only has one result SUPER C and not both SUPER C and IGA?
sas-inquirer
Quartz | Level 8
Unless anyone has a more automated way of doing it, I think I'm going to have to hardcode some values. I plan on removing IGA from the have_2 table and hardcoding the store_variation(s) into an if statement using the index function.
Reeza
Super User

For single word matches can you try FIND() or FINDW() instead?

FINDW/INDEXW looks for a full word match, whereas FIND/INDEX will find embedded matches, such as in a word.

sas-inquirer
Quartz | Level 8
Hi Reeza,
I've tried STRIP with CONTAINS, FIND with TRIM, INDEX and FINDW but I found with each I did not get the desired result. With these, it takes care of not coding SUPER C SHAWNIGAN to IGA, but then neglects to code FRESON BROS. IGA to IGA. Thanks for your suggestion. I could use FIND/FINDW for the single word cases, but I would still need to hardcode if index(STORE," IGA " to map IGA to FRESON BROS. IGA.
Reeza
Super User

To me, treating the single word conditions differently would still categorize the IGA one. 

Unfortunately with this type of search each approach has it's pro/cons so you have to pick the best that works, or use multiple approaches.

 

Open Refine is not a bad tool to do this work either.

Tom
Super User Tom
Super User

@sas-inquirer wrote:
Hi Reeza,
I've tried STRIP with CONTAINS, FIND with TRIM, INDEX and FINDW but I found with each I did not get the desired result. With these, it takes care of not coding SUPER C SHAWNIGAN to IGA, but then neglects to code FRESON BROS. IGA to IGA. Thanks for your suggestion. I could use FIND/FINDW for the single word cases, but I would still need to hardcode if index(STORE," IGA " to map IGA to FRESON BROS. IGA.

If you want to add spaces around the search term and use the INDEX() function then make sure to also add spaces around the string you are searching so it will match at the ends of the string.

index(cat(' ',string,' '),cat(' ',strip(word),' '))

You don't need to add the spaces if you use INDEXW() instead. Another advantage of using INDEXW() is that it will also work for finding IGA in strings like 'BROS.IGA'.  You can use the optional third argument to identify what characters are used to indicate word boundaries.  The default will include punctuation like periods. 

 

indexw(string,strip(word))

 

It might even be easier to use FINDW() instead.  Then you can use its modifiers to have it do case insensitive search and strip the blanks for you.

findw(string,word,,'spit')

 

 

 

sas-inquirer
Quartz | Level 8
Hi Tom,
Fantastic solutions! I am going with indexw. It worked like a charm. Thank you so much!!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 883 views
  • 5 likes
  • 3 in conversation