BookmarkSubscribeRSS Feed
honk
Obsidian | Level 7

Hi All,

 

I am trying to take line listed data with peoples names and birthdays (that may or may not be entered correctly) and essentially create a new value so I can match them together and identify them as one individual.

 

I have attached this SAS data set as an example. Here I have false names and birthdates where some match exactly, some are misspelled and some are not even the same people. I have set the scenario as orders for a company, so there is an order number and the customer number is blank. I want to populate the customernumber variable so that I can identify orders that came from the same indvidual, but their information may have been typed incorrectly, or not at all, by a clerk for each order.

 

I know this can not be done perfectly and a degree of error is at your mercy, but pointing me in the right direction towards any functions or papers that have been written on this topic would be great. I have recently just started looking at the SOUNDEX and SPEDIS functions and am I thinking how I could use them to achieve my goal.

 

Thank you.

10 REPLIES 10
Patrick
Opal | Level 21

If you have the SAS Data Quality Server licensed and properly configured then you can use DataFlux functions for creating standardized names and matchcodes.

Some DataFlux functionality can be called directlye out of "normal" SAS via functions:

http://support.sas.com/documentation/cdl/en/dqclref/68376/HTML/default/viewer.htm#p0k705exnmtpgin1xp...

 

honk
Obsidian | Level 7

@Patrick wrote:

If you have the SAS Data Quality Server licensed and properly configured then you can use DataFlux functions for creating standardized names and matchcodes.

Some DataFlux functionality can be called directlye out of "normal" SAS via functions:

http://support.sas.com/documentation/cdl/en/dqclref/68376/HTML/default/viewer.htm#p0k705exnmtpgin1xp...

 



Unfortunately I don't have access to SAS Data Quality Server. Thanks for suggesting the function though, it seems like it could be useful.

FriedEgg
SAS Employee
There are many options available. If you are dealing with more information than just name, i.e. name & dob, address, etc... I would recommend that you use probabilistic record linking. You can use http://the-link-king.com/ which is written in SAS and is free. It also offers a lot of other methods and requires virtually no programming on your part.
honk
Obsidian | Level 7

@FriedEgg wrote:
There are many options available. If you are dealing with more information than just name, i.e. name & dob, address, etc... I would recommend that you use probabilistic record linking. You can use http://the-link-king.com/ which is written in SAS and is free. It also offers a lot of other methods and requires virtually no programming on your part.

 

Cool thanks, this looks promising. I'm going to check this out

honk
Obsidian | Level 7

@Reeza wrote:

Reference to an open source solution:

http://www5.statcan.gc.ca/olc-cel/olc.action?lang=en&ObjId=10H0036&ObjType=22

 

I like the solution for names from @FriedEgg here: https://communities.sas.com/t5/SAS-Procedures/Name-matching/m-p/82780/highlight/true#M23757

 

but it's only for names. 


Is this similar to Link Plus, because I think I have access to that.

Reeza
Super User

I don't know what Link Plus is...

honk
Obsidian | Level 7

@Reeza wrote:

I don't know what Link Plus is...


 

The reason I asked is it's difficult to get some softwares installed at work. It seems similar. Thanks for the info

 

"Link Plus is a probabilistic record linkage program developed at CDC’s Division of Cancer Prevention and Control in support of CDC’s National Program of Cancer Registries (NPCR). Link Plus is a record linkage tool for cancer registries. It is an easy-to-use, standalone application for Microsoft® Windows® that can run in two modes—

  • To detect duplicates in a cancer registry database.
  • To link a cancer registry file with external files.

Although originally designed to be used by cancer registries, the program can be used with any type of data in fixed width or delimited format. Used extensively across a diversity of research disciplines, Link Plus is rapidly becoming an essential linkage tool for researchers and organizations that maintain public health data."

 

 

FriedEgg
SAS Employee

It is similar in that they are both things that provide a method for performing probablistic record linking.  Link Plus is a specific application designed for cancer registry databases while the package I linked to is a SAS/AF application written to perform a wide bredth of record linking and name matching techniques.  If you are familiar with Link Plus, try using that software.

honk
Obsidian | Level 7

@FriedEgg wrote:

It is similar in that they are both things that provide a method for performing probablistic record linking.  Link Plus is a specific application designed for cancer registry databases while the package I linked to is a SAS/AF application written to perform a wide bredth of record linking and name matching techniques.  If you are familiar with Link Plus, try using that software.


I am not familiar with Link Plus, I just have access to it. I'd prefer to use SAS directly so once I figure out which program fits my needs that will probably be the best option. Thanks again.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 3561 views
  • 4 likes
  • 4 in conversation