Hello. I'm not sure if this is even possible but thought I'd check. I have 2 datasets. They both have multiple variables, but the one that is similar between the two are address variables, and those are the variables I need to try and match.
This issue is, one set has values such as:
193 FAIRVIEW LN
1006 NUT TREE ROAD
6121 PASEO DEL NORTE
9401 PAINTER AVE
3744 LONG BEACH BLVD
11801 PIERCE ST FL
While the other has values similar to this:
193 FAIRVIEW LN STE 100
1006 NUT TREE ROAD APT 5
6121 PASEO DEL NORTE BLDNG 2
9401 PAINTER AVE SUITE 2A
3744 LONG BEACH BLVD SUITE 224
11801 PIERCE ST FL APT 315
I want to create a format from one dataset and use that format on the other data set to create a new variable (say addr_match), where if theres a match, then addr_match = 1.
Is there a way to match these variables where if part of the addresses match then it'd be addr_match = 1, or do the values in the address fields need to be exactly the same? There are thousands of rows in each dataset.
To create a dataset of the matched results just add a CREATE TABLE to the query:
proc sql;
create table Want as
select h1.addr_short, h2.addr_long
from have1 h1 inner join have2 h2
on h2.addr_long like cats(h1.addr_short,'%')
;
quit;
Matching addresses is complicated. In fact SAS has aproduct designed to do this - SAS Data Quality. The question I have is how do you know the examples you provide are indeed the same addresses? If you are validating these addresses for postal purposes, then an apartment number at a street address is not the same as the actual street address itself. If all you are required to do is match the street address, that might be good enough.
Provide some example data in the form of data steps and perhaps you'll get some example code.
If you values are that "clean", not having to match "LN" to "LANE" for example it may be quite doable.
Format per se may not be the best approach though. Or you would have to code a function and then create a format using that function and may be more work that some sort of Join with a scoring function such as Verify or COMGED
Your sample data indicates that the shorter string is exactly contained in the longer string. If that's representative for your real data then code along the line of below could work.
data have1;
infile datalines truncover;
input addr_short $100.;
datalines;
193 FAIRVIEW LN
1006 NUT TREE ROAD
6121 PASEO DEL NORTE
9401 PAINTER AVE
3744 LONG BEACH BLVD
11801 PIERCE ST FL
;
data have2;
infile datalines truncover;
input addr_long $150.;
datalines;
193 FAIRVIEW LN STE 100
1006 NUT TREE ROAD APT 5
6121 PASEO DEL NORTE BLDNG 2
9401 PAINTER AVE SUITE 2A
3744 LONG BEACH BLVD SUITE 224
11801 PIERCE ST FL APT 315
;
/* using SAS find() function */
proc sql;
select h1.addr_short, h2.addr_long
from have1 h1 inner join have2 h2
on find(h2.addr_long,strip(h1.addr_short))>0
;
quit;
/* using SQL like operator */
proc sql;
select h1.addr_short, h2.addr_long
from have1 h1 inner join have2 h2
on h2.addr_long like cats(h1.addr_short,'%')
;
quit;
Thank you for this. I've ran this and I do get results showing matches. I want to output all matching values into a new dataset (move values from the h1 dataset that match the h2 values). Using the code you provided, is there a step I can add to output the results? Or can I do this in a different way that can output a new dataset that contains the matches?
To create a dataset of the matched results just add a CREATE TABLE to the query:
proc sql;
create table Want as
select h1.addr_short, h2.addr_long
from have1 h1 inner join have2 h2
on h2.addr_long like cats(h1.addr_short,'%')
;
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.