SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Matching and Collapsing Line Listed Data

Reply
Occasional Contributor
Posts: 14

Matching and Collapsing Line Listed Data

Hi All,

 

I am trying to take line listed data with peoples names and birthdays (that may or may not be entered correctly) and essentially create a new value so I can match them together and identify them as one individual.

 

I have attached this SAS data set as an example. Here I have false names and birthdates where some match exactly, some are misspelled and some are not even the same people. I have set the scenario as orders for a company, so there is an order number and the customer number is blank. I want to populate the customernumber variable so that I can identify orders that came from the same indvidual, but their information may have been typed incorrectly, or not at all, by a clerk for each order.

 

I know this can not be done perfectly and a degree of error is at your mercy, but pointing me in the right direction towards any functions or papers that have been written on this topic would be great. I have recently just started looking at the SOUNDEX and SPEDIS functions and am I thinking how I could use them to achieve my goal.

 

Thank you.

Respected Advisor
Posts: 3,889

Re: Matching and Collapsing Line Listed Data

If you have the SAS Data Quality Server licensed and properly configured then you can use DataFlux functions for creating standardized names and matchcodes.

Some DataFlux functionality can be called directlye out of "normal" SAS via functions:

http://support.sas.com/documentation/cdl/en/dqclref/68376/HTML/default/viewer.htm#p0k705exnmtpgin1xp...

 

Occasional Contributor
Posts: 14

Re: Matching and Collapsing Line Listed Data


Patrick wrote:

If you have the SAS Data Quality Server licensed and properly configured then you can use DataFlux functions for creating standardized names and matchcodes.

Some DataFlux functionality can be called directlye out of "normal" SAS via functions:

http://support.sas.com/documentation/cdl/en/dqclref/68376/HTML/default/viewer.htm#p0k705exnmtpgin1xp...

 



Unfortunately I don't have access to SAS Data Quality Server. Thanks for suggesting the function though, it seems like it could be useful.

Super User
Posts: 17,819

Re: Matching and Collapsing Line Listed Data

Reference to an open source solution:

http://www5.statcan.gc.ca/olc-cel/olc.action?lang=en&ObjId=10H0036&ObjType=22

 

I like the solution for names from @FriedEgg here: https://communities.sas.com/t5/SAS-Procedures/Name-matching/m-p/82780/highlight/true#M23757

 

but it's only for names. 

Trusted Advisor
Posts: 1,300

Re: Matching and Collapsing Line Listed Data

There are many options available. If you are dealing with more information than just name, i.e. name & dob, address, etc... I would recommend that you use probabilistic record linking. You can use http://the-link-king.com/ which is written in SAS and is free. It also offers a lot of other methods and requires virtually no programming on your part.
Occasional Contributor
Posts: 14

Re: Matching and Collapsing Line Listed Data


FriedEgg wrote:
There are many options available. If you are dealing with more information than just name, i.e. name & dob, address, etc... I would recommend that you use probabilistic record linking. You can use http://the-link-king.com/ which is written in SAS and is free. It also offers a lot of other methods and requires virtually no programming on your part.

 

Cool thanks, this looks promising. I'm going to check this out

Occasional Contributor
Posts: 14

Re: Matching and Collapsing Line Listed Data


Reeza wrote:

Reference to an open source solution:

http://www5.statcan.gc.ca/olc-cel/olc.action?lang=en&ObjId=10H0036&ObjType=22

 

I like the solution for names from @FriedEgg here: https://communities.sas.com/t5/SAS-Procedures/Name-matching/m-p/82780/highlight/true#M23757

 

but it's only for names. 


Is this similar to Link Plus, because I think I have access to that.

Super User
Posts: 17,819

Re: Matching and Collapsing Line Listed Data

I don't know what Link Plus is...

Occasional Contributor
Posts: 14

Re: Matching and Collapsing Line Listed Data


Reeza wrote:

I don't know what Link Plus is...


 

The reason I asked is it's difficult to get some softwares installed at work. It seems similar. Thanks for the info

 

"Link Plus is a probabilistic record linkage program developed at CDC’s Division of Cancer Prevention and Control in support of CDC’s National Program of Cancer Registries (NPCR). Link Plus is a record linkage tool for cancer registries. It is an easy-to-use, standalone application for Microsoft® Windows® that can run in two modes—

  • To detect duplicates in a cancer registry database.
  • To link a cancer registry file with external files.

Although originally designed to be used by cancer registries, the program can be used with any type of data in fixed width or delimited format. Used extensively across a diversity of research disciplines, Link Plus is rapidly becoming an essential linkage tool for researchers and organizations that maintain public health data."

 

 

Trusted Advisor
Posts: 1,300

Re: Matching and Collapsing Line Listed Data

[ Edited ]

It is similar in that they are both things that provide a method for performing probablistic record linking.  Link Plus is a specific application designed for cancer registry databases while the package I linked to is a SAS/AF application written to perform a wide bredth of record linking and name matching techniques.  If you are familiar with Link Plus, try using that software.

Occasional Contributor
Posts: 14

Re: Matching and Collapsing Line Listed Data


FriedEgg wrote:

It is similar in that they are both things that provide a method for performing probablistic record linking.  Link Plus is a specific application designed for cancer registry databases while the package I linked to is a SAS/AF application written to perform a wide bredth of record linking and name matching techniques.  If you are familiar with Link Plus, try using that software.


I am not familiar with Link Plus, I just have access to it. I'd prefer to use SAS directly so once I figure out which program fits my needs that will probably be the best option. Thanks again.

Ask a Question
Discussion stats
  • 10 replies
  • 623 views
  • 4 likes
  • 4 in conversation