BookmarkSubscribeRSS Feed
takumamih
Calcite | Level 5

I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?

 

This is potentially a problem because I am trying to match the lat_lon_sort dataset with several different datasets that look similar to the O3 dataset, but with slightly different variable names.

 

lat_lon_sort:

Abbotsford_A_Columbia_Street

Abbotsford_Central

Agassiz_municipal_Hall

Burnaby_Burmount

Burnaby_Kensington_Park

Burnaby_Mountain

 

O3:

Abfd_A_Col_St_O3

Abfd_A_O3

Abfd_Central_O3

Aggsz_Mun_Hall_O3

Burnaby_Kens_Park_O3

Burnaby_Mountain_O3

2 REPLIES 2
Reeza
Super User

@takumamih wrote:

I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?

 


Yes. This is known as fuzzy matching and is a very difficult issue to solve because it's hard to identify the matches. You first need to consider how accurate the reuslts need to be before you can determine what approach to take. 


If close enough is ok, you can look at using the COMPGED/COMPLEV or similar functions to match. They provide a 'distance' measure for character variables so you can see how similar they are when comparing. You also need to consider timing, because its not an exact match you need to now compare each value to every other value which can really really be problematic for big datasets.

 

 

Patrick
Opal | Level 21

@takumamih

On top of what @Reeza wrote: If you've got the Data Quality Server licensed then you can also create match codes using dqmatch() and then join over these match codes.

http://support.sas.com/documentation/cdl/en/dqclref/70016/HTML/default/viewer.htm#p09nffezbjyj4on11o...

 

Based on the data you've posted you would still need to implement some data prep logic before feeding into dqmatch() as the strings look too different to each other as that I'd expect dqmatch() to return the same match codes for values you'd like to get matched.

 

What using dqmatch() would solve for you:

1. You're creating first these match codes on row level (that's the fuzzy part) but then join via an exact match over these match codes and though don't need anymore compare everything with everything.

2. You take advantage of a knowledge base (QKB) which comes already with a lot of pre-defined rules so you don't need to start from scratch.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1001 views
  • 4 likes
  • 3 in conversation