BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Angel_Saenz
Quartz | Level 8

How to make a match between two variables similar to the merge but that the condition is that they are not exactly the same, but just similar?

 

For example I have 2 datasets:

 

dataset1-var_street

Manhattan street 1000

 

dataset2-var_street

Madison 1000

Manhattan st 1000

Manhattan street 1000

Manhatan street 1000

Manhattan st. 1000

Moore 1000

Manhattan 1100

Manhattan street 100

Manhattan st. 2000

 

If I use MERGE function the result is only one obs:

dataset3-var_street

Manhattan street 1000

 

But I want some like this:

dataset3-var_street

Manhattan st 1000

Manhattan street 1000

Manhatan street 1000

Manhattan st. 1000

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Your trying to do a fuzzy match. You can find a number of possibilities searching for that term.

 

Here is one method. You are looking for low scores, but have to decide how low is still a good match. In your example, 300 or lower would work:

data dataset1;
  informat var_street $100.;
  input var_street & number;
  cards;
Manhattan street  1000
;
 
data dataset2;
  informat var_street $100.;
  input var_street & number;
  cards;
Madison  1000
Manhattan st  1000
Manhattan street  1000
Manhatan street  1000
Manhattan st.  1000
Moore  1000
Manhattan  1100
Manhattan street  100
Manhattan st.  2000
;

/* get number of records in bankinfo dataset */
data _null_;
 if 0 then set dataset2 nobs=nobs;
 call symput('numrec',nobs);
 stop;
run;

data test (keep=var_street number compare_add compare_num score);
  array addresses(&numrec) $100;
  array numbers(&numrec);
  do i=1 to &numrec;
     set dataset2(rename=(var_street=add_check number=num_check));
    addresses(i)=add_check;
    numbers(i)=num_check;
  end;
  do until (eof); 
    set dataset1 end=eof;
    do i=1 to &numrec;
      if number eq numbers(i) then do;
        compare_add=addresses(i);
        compare_num=numbers(i);
        score= compged(var_street,addresses(i));
        output;
      end;
    end;
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

5 REPLIES 5
art297
Opal | Level 21

Your trying to do a fuzzy match. You can find a number of possibilities searching for that term.

 

Here is one method. You are looking for low scores, but have to decide how low is still a good match. In your example, 300 or lower would work:

data dataset1;
  informat var_street $100.;
  input var_street & number;
  cards;
Manhattan street  1000
;
 
data dataset2;
  informat var_street $100.;
  input var_street & number;
  cards;
Madison  1000
Manhattan st  1000
Manhattan street  1000
Manhatan street  1000
Manhattan st.  1000
Moore  1000
Manhattan  1100
Manhattan street  100
Manhattan st.  2000
;

/* get number of records in bankinfo dataset */
data _null_;
 if 0 then set dataset2 nobs=nobs;
 call symput('numrec',nobs);
 stop;
run;

data test (keep=var_street number compare_add compare_num score);
  array addresses(&numrec) $100;
  array numbers(&numrec);
  do i=1 to &numrec;
     set dataset2(rename=(var_street=add_check number=num_check));
    addresses(i)=add_check;
    numbers(i)=num_check;
  end;
  do until (eof); 
    set dataset1 end=eof;
    do i=1 to &numrec;
      if number eq numbers(i) then do;
        compare_add=addresses(i);
        compare_num=numbers(i);
        score= compged(var_street,addresses(i));
        output;
      end;
    end;
  end;
run;

Art, CEO, AnalystFinder.com

 

Angel_Saenz
Quartz | Level 8
GREAT! thank you Art!
Angel_Saenz
Quartz | Level 8

Art, I have one question, what happend if I have more than 1 obs in dataset 1:

For example:

 

data dataset1;
  informat var_street $100.;
  input var_street & number;
  cards;
Manhattan street  1000

Moore                   1001
;

 

And I want to compare two addresses Manhattan street  1000 and Moore   1001 ?

art297
Opal | Level 21

@Angel_Saenz: The code, as is, will accomodate any number of records in either of the two datasets. Until you discover what value provides false positives, you could always sort the resulting file by 

var_street number score

and then, possibly, only keep the record (for each var_street number) with the lowest score.

 

Art, CEO, AnalystFinder.com

 

Patrick
Opal | Level 21

@Angel_Saenz

If you've got the SAS Data Quality Server (comes also with offerings like the SAS Data Management bundle I believe) then you could first standardize your addresses and then use these standardized addresses for the merge.

With the DQ server ("DataFlux") you could also create match codes and then merge over these match codes.

 

With match codes you'll end up with a few wrong matches, which standardized addresses you'll miss a few matches.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 856 views
  • 3 likes
  • 3 in conversation