DATA Step, Macro, Functions and more

Merging/Joining Datasets

Reply
New Contributor
Posts: 3

Merging/Joining Datasets

I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?

 

This is potentially a problem because I am trying to match the lat_lon_sort dataset with several different datasets that look similar to the O3 dataset, but with slightly different variable names.

 

lat_lon_sort:

Abbotsford_A_Columbia_Street

Abbotsford_Central

Agassiz_municipal_Hall

Burnaby_Burmount

Burnaby_Kensington_Park

Burnaby_Mountain

 

O3:

Abfd_A_Col_St_O3

Abfd_A_O3

Abfd_Central_O3

Aggsz_Mun_Hall_O3

Burnaby_Kens_Park_O3

Burnaby_Mountain_O3

Super User
Posts: 17,758

Re: Merging/Joining Datasets


takumamih wrote:

I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?

 


Yes. This is known as fuzzy matching and is a very difficult issue to solve because it's hard to identify the matches. You first need to consider how accurate the reuslts need to be before you can determine what approach to take. 


If close enough is ok, you can look at using the COMPGED/COMPLEV or similar functions to match. They provide a 'distance' measure for character variables so you can see how similar they are when comparing. You also need to consider timing, because its not an exact match you need to now compare each value to every other value which can really really be problematic for big datasets.

 

 

Respected Advisor
Posts: 3,887

Re: Merging/Joining Datasets

@takumamih

On top of what @Reeza wrote: If you've got the Data Quality Server licensed then you can also create match codes using dqmatch() and then join over these match codes.

http://support.sas.com/documentation/cdl/en/dqclref/70016/HTML/default/viewer.htm#p09nffezbjyj4on11o...

 

Based on the data you've posted you would still need to implement some data prep logic before feeding into dqmatch() as the strings look too different to each other as that I'd expect dqmatch() to return the same match codes for values you'd like to get matched.

 

What using dqmatch() would solve for you:

1. You're creating first these match codes on row level (that's the fuzzy part) but then join via an exact match over these match codes and though don't need anymore compare everything with everything.

2. You take advantage of a knowledge base (QKB) which comes already with a lot of pre-defined rules so you don't need to start from scratch.

Ask a Question
Discussion stats
  • 2 replies
  • 192 views
  • 4 likes
  • 3 in conversation