Search for word or string in a char variable (trying to use PRXMATCH)

esuhco · Posted 10-26-2018 08:17 AM

Hello All,

I have a year's worth of customer transactions (c.11m observations). This dataset shows where customers have spent/transferred their money and the goal of the analysis is to identify behaviours that can be used for customer scoring models. However, the problem I have now is that the merchant names (companies where customers spent their money) vary greatly. After doing a proc summary, there are c.290k types of merchant names. But a lot of them are the same thing, just spelt differently, e.g. Sainsburys vs. JSainsburys vs. J Sainsburys vs. www.sainsburys.com etc.

I wanted to do some kind of filter where the SAS program could recognise similar names and make a separate variable with a corrected name. I used the FIND and FINDW functions but it just did not want to work (most likely I made a mistake somewhere). Reading this forum I also found the PRXMATCH function, and have tried it below and there has been some success.

Problem is, in the output file I can see that the PRXM is doing something. It correctly identifies merchants that are not in the below IF statements and gives them a 0 PRXM flag. However, the merchants that are listed below, their PRXM flag and new merchant names are just blank. So it's doing something, and on the right track, but now what I want.

I read somewhere that the FIND function can only work on a single word/string in a SAS data step, so the below code maybe wouldn't work if I replaced PRXM with FIND. But I'm having a hard time finding the appropriate literature for what I want. in most cases the user just wants to replace a single string in a variable with a different string OR filter out using a WHERE CONTAINS kind of function. I don't want to do that as I am trying to clean up the data rather than cut-up/delete observations.

Any help would be greatly appreciated.

data clean4;

set clean3;

length MRCHNT_NM_V4 $30;

if prxmatch("m/MCDONALD/oi",MRCHNT_NM_V3) >0 then MRCHNT_NM_V4="MCDONALDS" and prxm_FG=1;

else if prxmatch("m/ITUNES/oi",MRCHNT_NM_V3) >0 then MRCHNT_NM_V4="ITUNES" and prxm_FG=1;

else if prxmatch("m/ASDA/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="ASDA" and prxm_FG=1;

else if prxmatch("m/SAINSBURY/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="SAINSBURYS" and prxm_FG=1;

else if prxmatch("m/LIDL/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="LIDL" and prxm_FG=1;

else if prxmatch("m/ALDI/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="ALDI" and prxm_FG=1;

else if prxmatch("m/AMAZON/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="AMAZON" and prxm_FG=1;

else if prxmatch("m/MORRISON/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="MORRISONS" and prxm_FG=1;

else if prxmatch("m/NETFLIX/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="NETFLIX" and prxm_FG=1;

else if prxmatch("m/FACEBOOK/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="FACEBOOK" and prxm_FG=1;

else if prxmatch("m/PAYPAL/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="PAYPAL" and prxm_FG=1;

else if prxmatch("m/SPOTIFY/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="SPOTIFY" and prxm_FG=1;

else if prxmatch("m/SPAR/oi",MRCHNT_NM_V3) > 0 then MRCHNT_NM_V4="SPAR" and prxm_FG=1;

else prxm_FG=0;

run;

VDD · Posted 10-26-2018 08:31 AM

can you set 2 different values based on an if statement without using a do loop?

@esuhcohas: if prxmatch("m/MCDONALD/oi",MRCHNT_NM_V3) >0 then MRCHNT_NM_V4="MCDONALDS" and prxm_FG=1;

should it be written as:

if prxmatch("m/MCDONALD/oi",MRCHNT_NM_V3) >0 then

do;

MRCHNT_NM_V4="MCDONALDS" ;

prxm_FG=1;

end;

esuhco · Posted 10-26-2018 09:13 AM

Hi VDD, thank you for the quick reply. I think the method you proposed worked (though I don't understand why my IF-THEN-ELSE didn't work, but the IF-THEN-DO did). I had to research how to do a IF-THEN-DO statement as I never use them at work.

My follow up question would be (if you have some time) is: how can I put in a ELSE if non of the PRXMs match to those special names, then just copy the old name into the new name. Apologies if this is a silly question, but I have very limited knowledge of DO functions.

This is my updated code.

data clean4b;

set clean3;

length MRCHNT_NM_V4 $30;

if prxmatch("m/MCDONALDS/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="MCDONALDS";

prxm_FG=1;

end;

if prxmatch("m/SAINSBURY/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="SAINSBURYS";

prxm_FG=1;

end;

if prxmatch("m/MORRISON/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="MORRISON";

prxm_FG=1;

end;

if prxmatch("m/NETFLIX/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="NETFLIX";

prxm_FG=1;

end;

if prxmatch("m/SPOTIFY/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="SPOTIFY";

prxm_FG=1;

end;

if prxmatch("m/SPAR/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="SPAR";

prxm_FG=1;

end;

run;

VDD · Posted 10-26-2018 09:19 AM

data clean4b;

set clean3;

length MRCHNT_NM_V4 $30;

if prxmatch("m/MCDONALDS/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="MCDONALDS";

prxm_FG=1;

end;

else if prxmatch("m/SAINSBURY/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="SAINSBURYS";

prxm_FG=1;

end;

run;

esuhco · Posted 10-26-2018 09:57 AM

Hi VDD,

I had mixed success with this one. I tried ELSE DO, but SAS didn't like that saying I wasn't referring to an actual IF portion. So I bypassed this, maybe amateurishly, by creating a second DATA step. I think this makes sense, because the ELSE statement probably couldn't fit into the main PRXM program, because it would then override previous ELSE IF PRXM codes. What do you think?

Also, fyi when I refer to ELSE statements I mean IF x=a THEN y=b, ELSE IF x=c then y=d; ELSE y=e;. It's the last ELSE y=e bit that I can't seem to recreate in the IF-THEN-DO system.

data clean4;

set clean3;

length MRCHNT_NM_V4 $30;

if prxmatch("m/MCDONALDS/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="MCDONALDS";

prxm_FG=1;

end;

else if prxmatch("m/SAINSBURY/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="SAINSBURYS";

prxm_FG=1;

end;

else if prxmatch("m/MORRISON/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="MORRISON";

prxm_FG=1;

end;

else if prxmatch("m/NETFLIX/oi",MRCHNT_NM_V3) >0 then do;

MRCHNT_NM_V4="NETFLIX";

prxm_FG=1;

end;

run;

data clean5;

set clean4;

if prxm_FG ne 1 then MRCHNT_NM_V4=MRCHNT_NM_V3;

run;

SuryaKiran · Posted 10-26-2018 08:59 AM

I would say if you use one single function that may not help you. I suggest use multiple function to identify your data pattern or fuzzy matching on names. COMPGED() and SPEDIS() are great function generally used for finding fuzzy matches. You may need to score your pattern matching and then create a new variable based on the score.

Example:

data have;
input name $50.;
datalines;
Sainsburys
JSainsburys
J Sainsburys
www.sainsburys.com
www.amazon.com
;
run;

proc sql;
select 	a.name,
		COMPGED(a.name,"Sainsburys") as Score_COMPGED,
		SPEDIS(a.name,"Sainsburys") as Score_SPEDIS,
		SPEDIS("Sainsburys",a.name) as Score_SPEDIS_rev,
		case when CALCULATED Score_COMPGED<=1000 and
				  CALCULATED Score_SPEDIS<=100 and
				  CALCULATED Score_SPEDIS_rev<=100
			THEN 'Sainsburys' ELSE a.name end as new_name
from have as a;
quit;

Check this blog on fuzzy name matching.

Thanks,
Suryakiran

esuhco · Posted 10-26-2018 09:18 AM

Hi Suryakiran,

Thank you for your reply, it's very useful. I did try at one point to do the fuzzy matching but the literature online was just not made for someone of my low SAS level.

Your example below will be super useful to me because I have used PRXM matching just now and VDD's reply on this forum helped me make it work. But in some instances, the find/PRXM function is not great. For example, after I made the PRXM function work with VDD's help, I noticed that any merchant with "spar" string in it became "SPAR" the food/grocery. This is not correct. So I will most likely need to use the fuzzy matching or even SCAN for the SPAR example.

I think PRXM will be very useful for companies that have quite unique names (that can't be parts of other company names), e.g. NETFLIX, SPOTIFY, FACEBOOK, GOOGLE, SAINSBURY (to an extend).

I will give this a try and post back my results when I've done this.

Kind regards,

esuhco

esuhco · Posted 10-26-2018 10:07 AM

Hi Suryakiran,

I just tried to create a fuzzy logic code based on what you've said and ran into a problem almost straight away.

My original list of transactions (without any previous filtering e.g. PRXM etc) has c.290k proc summary observation types. Based on what you wrote below, does that mean I have to create a large amount of Score_COMPGED, Score_SPEDIS and Score_SPEDIS_rev variables for each observation comparable?

What I mean by this in a simplified way - say that I know that I have specifically Sainsburys, Netflix, Facebook, Paypal, Spar and Shell . observations with different spelled and typed names ( e.g. SHELL WHARTONS, SHELL WINDMILL, SHELL etc). Would I have to create a Score_C, Score_S and Score_Sr variables for each type of fuzzy logic name that I have? Because that would mean I have to create a massive "multiplication" type table for this scenario, no?

Or am I missing something here.

To just check for Sainsbury's looking at the below code is fairly straightforward (I'm guessing). But if I wanted to to do that for loads of other names, then my table would be growing by 3 extra variable * # of fuzzy names?

What do you think? Apologies for the convoluted question. it's very difficult to ask google these long questions.

Kind regards,

esuhco

SuryaKiran · Posted 10-26-2018 10:43 AM

Score columns are only to understand the scoring and to decide the limit to set and not necessarily required in your final output once you know what value to set to get some meaningful results.

You can create a single required column using CASE WHEN expression for multiple conditions.

proc sql;
select 	a.name,
		case when COMPGED(upcase(a.name),"SAINSBURYS")<=1000 and
				  SPEDIS(upcase(a.name),"SAINSBURYS")<=100 and
				  SPEDIS("SAINSBURYS",a.name)<=100 THEN 'SAINSBURYS' 
			 when COMPGED(upcase(a.name),"NETFLIX")<=1000 and
				  SPEDIS(upcase(a.name),"NETFLIX")<=100 and
				  SPEDIS("NETFLIX",a.name)<=100 THEN 'NETFLIX'
			when COMPGED(upcase(a.name),"FACEBOOK")<=1000 and
				  SPEDIS(upcase(a.name),"FACEBOOK")<=100 and
				  SPEDIS("FACEBOOK",a.name)<=100 THEN 'FACEBOOK'
			else a.name end as new_name
from have as a;
quit;

Thanks,
Suryakiran

esuhco · Posted 10-26-2018 11:23 AM

Hi Suryakiran,

Oh wow, this looks like exactly what I need - the CASE function example. Awesome, thank you so much for your quick reply. I'm gonna try this over the weekend, as i'll be heading out soon. But will definitely keep you posted!!!!!!

Thanks again for your help and enjoy your weekend!

Kind regards,

esuhco

Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Re: Search for word or string in a char variable (trying to use PRXMATCH)

Registration is open