Proc Summary problem

esuhco · Posted 10-26-2018 10:28 AM

Hello all,

I have a proc summary problem. I have a large transactions dataset (c.10m observations). The dataset has merchant names (e.g. SHELL, NETFLIX, FACEBOOK etc). When I run a proc summary of merchant names, it outputs c.280k types of merchant names. But a lot of them are similar, see below list.

Is there a smart/"AI-ish" way for me to go through the PROC_SUMMARY output and say that SHELL EASINGTON EAST and SHELL BIRCHWOOD are all just "SHELL"?

Currently (on another question page) I am trying to do this by using PRXMatching. But this is super long and tedious and I have run then rerun to see what the next merchant name I should add to the PRXM find and switch code.

Is there a way in PROC SUMMARY or other SAS procedure that I can give it the proc summary output of 280k observations (of merchant type names) and it could tell me that some names can be merged into one, e.g. like the above SHELL example?

I also received a very useful reply on my other question, that I could use fuzzy merging instead of PRXM. But having looked into it (and maybe I'm wrong here), but that would mean I have to create a COMPGED and SPEDIS score variable and SCORE_flag variable for each different merchant name type, and this would also take a long time. No?

FYI - my earlier/other question for the PRXM method is here (https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/Search-for-word-or-string-in-a-char-varia... ), if you need more history. I felt like this question on a smarter proc summary deserved a different question, as it's slightly different ask.

SCARPETTA 1 32
SCOTMID COOP PENRITH 0 1 32
SECC ARENA 1 32
SEFTON ARMS 1 32
SELCO T C - OLD KENT R 1 32
SHELL APSLEY 1 32
SHELL BIRCHWOOD 1 32
SHELL CHESHAM 1 32
SHELL CLARENCE RD 1 32
SHELL CRUMPSALL SERVIC 1 32
SHELL EASINGTON EAST 1 32
SHELL HOP OAST 112 1 32
SHELL LINCOLN 1 32
SHELL LYME GREEN 1 32
SHELL OAKLANDS PK 1 32
SHELL OAKLANDSPARK 1 32
SHELL RISE PARK 30 1 32
SHELL ROSYTH 1 32
SHELL SHINFIELD 1 32
SHELL SHIRLEY 1 32
SHELL STPHILLIPS 1 32
SHELL TELFORD 1 32

PaigeMiller · Posted 10-26-2018 10:33 AM

This isn't really a PROC SUMMARY problem. PROC SUMMARY did exactly what it was supposed to do.

It is a data problem.

You (or someone) has to define the rules that will be used to combine the merchant names. There really isn't a canned way of doing this.

--
Paige Miller

esuhco · Posted 10-26-2018 11:26 AM

Hi Peige, you're absolutely right - it's not really a proc summary issue. More of a data summarisation and usage kind of question. A lot of fuzzy logic has been mentioned so I will try to read into this.

Thank you for your help.

Kind regards,

Esuhco

Reeza · Posted 10-26-2018 11:44 AM

Google: Link King & Mayo or CDC have a fuzzy matching for people algorithms and Statistics Canada has G-Link which is also used for people.

Reeza · Posted 10-26-2018 11:47 AM

Use an iterative approach, how many match identically. Remove those. How many match close with fuzzy logic, then remove those and keep going until you'r done. For SHELL, use SCAN() to get the first word and check on that part.

See this post and FriedEgg's solution which is iterative.

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

Reeza · Posted 10-26-2018 10:52 AM

Fuzzy matching is really, really hard. There is another tool, called DataFlux or DataManagement Studio that does a relatively decent jobs, but it's still off. You really need a company identifier instead of relying on company names. Another option I would recommend is OpenRefine which is a point and click tool, but generates some (I think python) code behind the scenes.

ballardw · Posted 10-26-2018 11:08 AM

@Reeza wrote:
Fuzzy matching is really, really hard. There is another tool, called DataFlux or DataManagement Studio that does a relatively decent jobs, but it's still off. You really need a company identifier instead of relying on company names. Another option I would recommend is OpenRefine which is a point and click tool, but generates some (I think python) code behind the scenes.

@Reeza brings up a very significant point about names if you are dealing with any period of time. Many corporations change names, merge and split. You business logic needs to determine what is done as these activities happen and names are only part of it.

And one project I worked where the data collection staff had instructions on what to enter for company names still managed to have 18 ways to spell "IBM" (my 'favorite' was "I>B>M>")

esuhco · Posted 10-26-2018 11:38 AM

@Reeza I will look into the functions and systems that you have mentioned. Thank you for your reply.

@ballardw you are absolutely right, the naming is an ongoing battle.

To give a bit of history to this question - I am currently part of a Kaggle-style data competition. We've been given a year's worth of transactions data and the challenge is to identify new/unique customer characteristics that can be integrated into the current customer models. So in the spirit of creating something new and unique, I didn't want to copy the processes that our modelling teams currently use for data optimisation and cleaning. I wanted to take the dataset that they had given us, with the assumption that enough data optimization was done by the modelling team, and add "value" to it by looking into it differently.

Regarding the names of merchants, I REALLYYYY don't want to manually input the names into a fuzzy logic/PRXMatch type scenario because names will change or new companies will be set up and this will mean that every 6 months or so we will need to go back into the code and figure out which names are no longer valid and which names need to be added in. So my thought was to create some kind of "smart" SAS code that could break down the names and figure it out for itself that SHELL WOKING and SHELL LONDON are basically just SHELL. I imagine that the merchant name SHELL WOKING is because it was set up on the merchant end and they know they are SHELL, so they want to differentiate by location of the branch. For us, as a user of this data, I know that the branch is in Woking because I have another variable that gives me the location of the transaction. I also wanted to use another variable called something like MCC merchant categorisation, but I found out that, for example, AMAZON shares this code with another company called CREDITEXPERT (or something) - so that's a no-go for me.

Apologies for waffling. It is a very interesting challenge and has opened my eyes to the difficulties of being in the back office trying to figure out what is what and give meaningful analysis and directions to my senior managers - from a modelling perspective.

Have a good weekend guys and I will be looking into your suggestions and also feeding back my progress 🙂

Kind regards,

Esuhco

Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

Re: Proc Summary problem

SAS Innovate 2025: Save the Date