Solved: Re: Match two datasets on a partial match

JH74 · Posted 09-08-2023 06:39 PM

Hello. I'm not sure if this is even possible but thought I'd check. I have 2 datasets. They both have multiple variables, but the one that is similar between the two are address variables, and those are the variables I need to try and match.

This issue is, one set has values such as:

193 FAIRVIEW LN
1006 NUT TREE ROAD
6121 PASEO DEL NORTE
9401 PAINTER AVE
3744 LONG BEACH BLVD
11801 PIERCE ST FL

While the other has values similar to this:

193 FAIRVIEW LN STE 100
1006 NUT TREE ROAD APT 5
6121 PASEO DEL NORTE BLDNG 2
9401 PAINTER AVE SUITE 2A
3744 LONG BEACH BLVD SUITE 224
11801 PIERCE ST FL APT 315

I want to create a format from one dataset and use that format on the other data set to create a new variable (say addr_match), where if theres a match, then addr_match = 1.

Is there a way to match these variables where if part of the addresses match then it'd be addr_match = 1, or do the values in the address fields need to be exactly the same? There are thousands of rows in each dataset.

SASKiwi · Posted 09-11-2023 04:25 PM

To create a dataset of the matched results just add a CREATE TABLE to the query:

proc sql;
  create table Want as
  select h1.addr_short, h2.addr_long
  from have1 h1 inner join have2 h2
  on h2.addr_long like cats(h1.addr_short,'%')
  ;
quit;

View solution in original post

SASKiwi · Posted 09-08-2023 06:54 PM

Matching addresses is complicated. In fact SAS has aproduct designed to do this - SAS Data Quality. The question I have is how do you know the examples you provide are indeed the same addresses? If you are validating these addresses for postal purposes, then an apartment number at a street address is not the same as the actual street address itself. If all you are required to do is match the street address, that might be good enough.

ballardw · Posted 09-08-2023 07:17 PM

Provide some example data in the form of data steps and perhaps you'll get some example code.

If you values are that "clean", not having to match "LN" to "LANE" for example it may be quite doable.

Format per se may not be the best approach though. Or you would have to code a function and then create a format using that function and may be more work that some sort of Join with a scoring function such as Verify or COMGED

Patrick · Posted 09-08-2023 09:25 PM

Your sample data indicates that the shorter string is exactly contained in the longer string. If that's representative for your real data then code along the line of below could work.

data have1;
  infile datalines truncover;
  input addr_short $100.;
  datalines;
193 FAIRVIEW LN
1006 NUT TREE ROAD
6121 PASEO DEL NORTE
9401 PAINTER AVE
3744 LONG BEACH BLVD
11801 PIERCE ST FL 
;
data have2;
  infile datalines truncover;
  input addr_long $150.;
  datalines;
193 FAIRVIEW LN STE 100
1006 NUT TREE ROAD APT 5
6121 PASEO DEL NORTE BLDNG 2
9401 PAINTER AVE SUITE 2A
3744 LONG BEACH BLVD SUITE 224
11801 PIERCE ST FL APT 315
;

/* using SAS find() function */
proc sql;
  select h1.addr_short, h2.addr_long
  from have1 h1 inner join have2 h2
  on find(h2.addr_long,strip(h1.addr_short))>0
  ;
quit;

/* using SQL like operator */
proc sql;
  select h1.addr_short, h2.addr_long
  from have1 h1 inner join have2 h2
  on h2.addr_long like cats(h1.addr_short,'%')
  ;
quit;

JH74 · Posted 09-11-2023 03:50 PM

Thank you for this. I've ran this and I do get results showing matches. I want to output all matching values into a new dataset (move values from the h1 dataset that match the h2 values). Using the code you provided, is there a step I can add to output the results? Or can I do this in a different way that can output a new dataset that contains the matches?

SASKiwi · Posted 09-11-2023 04:25 PM

To create a dataset of the matched results just add a CREATE TABLE to the query:

proc sql;
  create table Want as
  select h1.addr_short, h2.addr_long
  from have1 h1 inner join have2 h2
  on h2.addr_long like cats(h1.addr_short,'%')
  ;
quit;

HB · Posted 09-11-2023 05:17 PM

Shorter string in a longer string
---h2.addr_long like cats(h1.addr_short,'%')
Worth the price of admission right there.

Registration is open

SAS Training: Just a Click Away