Matching two datasets on two character variables

TMorville · Posted 08-01-2012 06:01 AM

This is something that's been driving me crazy all afternoon, so now i'm here hoping for some input.

I have two datasets:

One is approx. 100 K obs. they have unique ID numbers, and will be the ones i'm matching up against.

My other dataset is approx. 14 K, and i need to find out, how many of these are in the big dataset. BUT...

The obs. in the smaller dataset, do not have unique ID, only names of people and their addresses.

So basically, i need to match the two datasets, on their names (as a first priority) and it there are duplicates, on their addresses.

Their entire name is in one variable, and that counts for both datasets. Adresses in the bigger dataset is split by roadname, and number in two

variables, but i've used a catx to merge them, so they're identical to the address variable in the smallerdataset.

I've chosen not to attatch any example data, as its quite simply a name variable (consisting of two to several names) and an andress variable, consisting of

one to several roadnames, and a housenumber.

I've tryed googleing different proc merge techniques, index function and macros, but i can't seem to anything to work.

Hope you can help!

Kind regards

Tobias

Patrick · Posted 08-01-2012 07:40 AM

If there is a match over name but not over address with no duplicate names. Is this a match or not? - If it's not a match then you could simply join over name and address from the very beginning.

Is your data stored as SAS data sets or are these database tables? - Thinking about loading your small data set into a hash table.

TMorville · Posted 08-01-2012 07:48 AM

Hi Patrick, and thanks for the input.

If there is match over names, and no duplicates - it's s full match. But i would like to match over address too if possible.

There are both partial and complete matches on both names and addresses.

On some accounts, i can get a clear match on the entire name, and address - but then in the next row, the name in the small dataset mig have Mike Marigold Johansen, while the bigger dataset has Mike M. Johansen. The same idea goes for the addess. The same person might appear living on Homesteadroad, and in the other dataset Homestead Road.

Im not completely sure what you mean by simply joining over name and address?

Both datasets are stored as SAS. The small one is importet frrom excel, the other is generated in SAS.

Patrick · Posted 08-01-2012 08:30 AM

"....Mike Marigold Johansen, while the bigger dataset has Mike M. Johansen..." Ooops! And I'm sure this is only one way of differences. There are also titles, last name and first name order not fix and so on. Typical messy address data.

You need first to standardize your data in one way or the other. If you've got only Base SAS then it's probabely with code like in the link Art posted.

The ideal tool to use for this task would be DataFlux which comes as part of SAS' Dataquality Server or also as stand alone product. But of course: It costs extra and is a beast on its own one must first learn to handle.

If this is not a once off task for you/your organisation or if it's really important to get as many correct matches as possible then it might be worth contacting your local SAS office about it. The big difference between the DataFlux approach to any coding approach: DataFlux uses a combination of database based and rule based algorithms to clean up data, and standardising address data is a standard problem. I've been in a 5 days fast track training this year and I was quite impressed how easy it was to streamline very messy address data with DataFlux.

TMorville · Posted 08-01-2012 09:03 AM

Hi Art, and thanks for the link.

The SAS code that's provided there is alot over my current datastepping abstraction level, but it seems like i'll have to dig in.

Unfortunately, getting DataFlux is not an option. I'm a student, and using the license that my university provided, which does not include that module.

I've attached two datasets to my original post now. They look like my standard data, but i've invented/manipulated the names a adresses. Its the smaller dataset that i need to match to the bigger one, to see how many of them are in the larger one. I've taken a few names and addresses from the smaller dataset, and copied them to the larger one, so i can see if the matching works.

I would really appriciate, if someone out there would take a look at it, and mabye guide me in the right direction.

art297 · Posted 08-01-2012 08:11 AM

Tobias,

You didn't Google the right terms. Take a look at: http://www.sconsig.com/sastips/tip00000.htm

The source code is provided at the bottom of the page.

HTH,

Art

Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Re: Matching two datasets on two character variables

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!