I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?
This is potentially a problem because I am trying to match the lat_lon_sort dataset with several different datasets that look similar to the O3 dataset, but with slightly different variable names.
lat_lon_sort:
Abbotsford_A_Columbia_Street
Abbotsford_Central
Agassiz_municipal_Hall
Burnaby_Burmount
Burnaby_Kensington_Park
Burnaby_Mountain
O3:
Abfd_A_Col_St_O3
Abfd_A_O3
Abfd_Central_O3
Aggsz_Mun_Hall_O3
Burnaby_Kens_Park_O3
Burnaby_Mountain_O3
@takumamih wrote:
I am trying to join two datasets, but the common variable is not exactly identical. Would I have to adjust the names of one of the datasets in order to successfully merge?
Yes. This is known as fuzzy matching and is a very difficult issue to solve because it's hard to identify the matches. You first need to consider how accurate the reuslts need to be before you can determine what approach to take.
If close enough is ok, you can look at using the COMPGED/COMPLEV or similar functions to match. They provide a 'distance' measure for character variables so you can see how similar they are when comparing. You also need to consider timing, because its not an exact match you need to now compare each value to every other value which can really really be problematic for big datasets.
On top of what @Reeza wrote: If you've got the Data Quality Server licensed then you can also create match codes using dqmatch() and then join over these match codes.
Based on the data you've posted you would still need to implement some data prep logic before feeding into dqmatch() as the strings look too different to each other as that I'd expect dqmatch() to return the same match codes for values you'd like to get matched.
What using dqmatch() would solve for you:
1. You're creating first these match codes on row level (that's the fuzzy part) but then join via an exact match over these match codes and though don't need anymore compare everything with everything.
2. You take advantage of a knowledge base (QKB) which comes already with a lot of pre-defined rules so you don't need to start from scratch.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.