SAS® practitioners are frequently called upon to do a comparison of data between two different data sets and find that the values in synonymous fields do not line up exactly. A second quandary occurs when there is one data source to search for particular values, but those values are contained in character fields in which the values can be represented in myriad different ways. This presentation discusses robust, if not warm and fuzzy, techniques for comparing data between, and selecting data in, SAS data sets in not so ideal conditions. SAS has provided a number of tools which can perform fuzzy matching. Among these tools are wild card searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; fuzzy comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this presentation is to introduce, by example, each of these tools and techniques.
SAS has provided a number of tools which can perform “fuzzy matching”. Among these tools are “wild card” searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; “fuzzy” comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this paper is to introduce, by example, each of these tools and techniques. We urge you to look at our paper for more complete information - this article will just contain "teasers".
Below is a table containing sample data for most of our fuzzy function examples in the paper. In the sections below, we'll present a single example of selected tools and techniques. Please see our paper for more exhaustive details on everything you ever wanted to know about fuzzy functions!
FIRSTNAME |
LASTNAME |
GENDER |
SCORE |
AGE |
|
Jan |
Write |
F |
1.00000000000012 |
44.9999999999990 |
|
Lucy |
Smyth |
F |
1.00000000000121 |
53.9383983572895 |
|
Kris |
Johnson |
F |
1.00324325660746 |
39.3566050650240 |
|
Chris |
Jones |
M |
0.00000000000121 |
48.8268309377139 |
|
Tracey |
Smith |
F |
0.00000000000012 |
46.4499999999991 |
|
Tracy |
Besley |
M |
0.00324325660746 |
47.4999999999990 |
|
Tracie |
Smith-jones |
F |
-2.00000000000921 |
35.6659822039699 |
|
Chrys |
Jones-Wright |
F |
-2.00000000000092 |
34.1546885694730 |
|
Jon |
Wright |
M |
1.00000000000038 |
46.8145106091718 |
|
John |
Hall |
M |
1.00000000000384 |
42.9949999999900 |
|
Timothy |
Bones |
M |
-12.00000000000920 |
48.3504449007529 |
|
Jason |
Jaones |
M |
-12.00000000000091 |
48.7118412046543 |
|
Tyler |
ones |
M |
-5.00413456081936 |
42.9499999999990 |
There are times when you may need to subset records in a data file but specifying the full search string may be an arduous and error prone process, or there may be multiple search strings that have a similar, identifiable pattern. You could default to the standard logic of VAR in (‘FULL VALUE1’ ‘FULL VALUE2’ etc.), or, you can use the following “fuzzy” techniques.
CONTAINS or ? Operators
If you are using a sub-setting WHERE statement you can use a question mark (?) or the word CONTAINS instead of an equal (=). The CONTAINS or ? will allow look for records that have values that contain what is specified.
Using the ROSTER data set shown above for people that have ‘Jones’ in their last name, the CONTAINS or ? operators can minimize the chance of error by searching for any record that contains the word ‘Jones’ in the variable LASTNAME.
data jones_1;
set roster;
where LASTNAME ? 'Jones';
run;
data jones_2;
set roster;
where LASTNAME contains 'Jones';
run;
Regardless of which operator is used they will both yield the same results.
FIRSTNAME |
LASTNAME |
GENDER |
Chris |
Jones |
M |
Chrys |
Jones-Wright |
F |
It is important to note that the CONTAINS or ? operators are case sensitive, thus if casing should be ignored, then it should be used in conjunction with the LOWCASE function or UPCASE function to force the variable to be one case.
There are several different SAS functions-based options that can be explored when using IF statements for which wild card techniques are not applicable. These “fuzzy” search techniques are explored below.
INDEX Function
If you need to search for a string anywhere within a variable or another string, then the INDEX function could be utilized. With the INDEX function you would need to provide two arguments: the variable or string to be searched and the string that you are searching for.
Syntax: INDEX(source, excerpt)
data jones_8a;
set roster;
if index(LASTNAME, 'Jones');
run;
FIRSTNAME |
LASTNAME |
GENDER |
Chris |
Jones |
M |
Chrys |
Jones-Wright |
F |
The display above shows a subset of the ROSTER data file containing last names of Jones using the INDEX function. The downside of using INDEX is that it is case sensitive. If you need to look for values regardless of case status, you could use the UPCASE function or LOWCASE function to force the source string to be one case and then specify the excerpt string to be the same case. The program and display below show that LASTNAME = ‘Smith-jones’ was missed in the initial program execution because ‘jones’ was not proper case.
data jones_8b;
set roster;
if index(upcase(LASTNAME),'JONES');
run;
FIRSTNAME |
LASTNAME |
GENDER |
Chris |
Jones |
M |
Tracie |
Smith-jones |
F |
Chrys |
Jones-Wright |
F |
Character strings, especially strings describing names and addresses, are notoriously dirty and prone to spacing, length and punctuation issues. Any real-world comparison of character strings or selection based on character strings needs to be both flexible and configurable, i.e. the degree of “sameness” needs to be quantifiable. SAS provides several character functions that allow you to make a fuzzy comparison: COMPARE, COMPGED, COMPLEV, SOUNDEX and SPEDIS. Each of these functions use a different fuzzy algorithm and can be used in conjunction with one another to achieve a (subjectively) optimal match. Use of these functions produces inexact results by definition, and results must be reviewed carefully. We discuss the SOUNDEX function below.
SOUNDEX Function
The SOUNDEX function determines how much two character variables sound alike. It works best with the English language. It is equivalent to using =* (sounds like) on a WHERE statement.
Syntax: SOUNDEX(argument)
With the SOUNDEX function vowels and the letters ‘H’, ‘W’ and ‘Y’ are excluded except when it is the first character in the argument when determining if the argument sounds like a specific value. Other characters in the English alphabet are assigned one of the following values:
The value generated from SOUNDEX is the first character in the argument and then for each character in the argument that is not excluded is assigned one of the values above. If there are two or more consecutive characters assigned the same numeric value, then only the first one is kept.
To demonstrate the use of SOUNDEX, we execute the following data step so that we can calculate the value generated from SOUNDEX using FIRSTNAME.
data roster_soundex;
set roster;
FN_SOUND = soundex(FIRSTNAME);
run;
FIRSTNAME |
LASTNAME |
FN_SOUND |
Jan |
Write |
J5 |
Lucy |
Smyth |
L2 |
Kris |
Johnson |
K62 |
Chris |
Jones |
C62 |
Tracey |
Smith |
T62 |
Tracy |
Besley |
T62 |
Tracie |
Smith-jones |
T62 |
Chrys |
Jones-Wright |
C62 |
Jon |
Wright |
J5 |
John |
Hall |
J5 |
Timothy |
Bones |
T53 |
Jason |
Jaones |
J25 |
Tyler |
ones |
T46 |
For rows 4 and 8, we see that both yield a value of ‘C62’. This is because both started with ‘C’ and the ‘h’, ‘y’ and ‘i’ were discarded, leaving only ‘r’ and ‘s’. The ‘r’ was assigned a value of ‘6’ and ‘s’ was assigned a value of ‘2’. Notice that if the third row would have started with a ‘C’ instead of a ‘K’ it would have resulted in the same value. However, since it is started with a ‘K’, the result was ‘K62’.
For the rows 5-7, we see that the value was ‘T62’ and this was because we discarded all the vowels and ‘y’ after the first character, leaving only ‘r’ and ‘c’.
For rows 9 and 10, the ‘o’ and ‘h’ were discarded leaving only the ‘n’ after the first argument, resulting in a value of ‘J5’.
All the techniques illustrated thus far have been dealing with character strings, but what if we have a numeric value? There are various options available for determining if a numeric value is equivalent to another numeric value. In some cases, the values will be exactly equal, and no additional comparison is needed. However, there are some cases where the values are not quite equal but are equal ‘enough’ so that if the values were rounded or truncated or a ‘fuzz’ factor is added then the values would be considered equal. Depending on the type of ‘fuzz’ factor you wish to consider will determine which function should be best utilized. Using the ROSTER data in Data Display 1, we will illustrate several numeric ‘fuzzy’ functions.
CEIL and CEILZ Functions
The CEIL function rounds UP to the nearest smallest integer that is greater than or equal to the argument, that is it will return an integer value that is greater than or equal to the argument. It uses fuzzing in order to avoid issues with floating points. If the result returned from the CEIL function is with 1E-12 of the argument, then the value is considered equal to the integer portion of the argument.
Syntax: CEIL(argument)
However, if you do not want to consider any fuzzing when rounding up to the nearest integer, then CEILZ is the function that should be used. CEILZ works the same as CEIL but it does not use fuzzing. Therefore, even if the return value is within 1E-12 of the argument it will round up to the nearest smallest integer instead of considering the value equal to the integer portion of the argument.
Syntax: CEILZ(argument)
data fuzz_score;
set roster;
S_CEIL = ceil(SCORE);
S_CEILZ = ceilz(SCORE);
run;
FIRSTNAME |
LASTNAME |
GENDER |
SCORE |
S_CEIL |
S_CEILZ |
Jan |
Write |
F |
1.00000000000012 |
1 |
2 |
Lucy |
Smyth |
F |
1.00000000000121 |
2 |
2 |
Kris |
Johnson |
F |
1.00324325660746 |
2 |
2 |
Chris |
Jones |
M |
0.00000000000121 |
1 |
1 |
Tracey |
Smith |
F |
0.00000000000012 |
0 |
1 |
Tracy |
Besley |
M |
0.00324325660746 |
1 |
1 |
Tracie |
Smith-jones |
F |
-2.00000000000921 |
-2 |
-2 |
Chrys |
Jones-Wright |
F |
-2.00000000000092 |
-2 |
-2 |
Jon |
Wright |
M |
1.00000000000038 |
1 |
2 |
John |
Hall |
M |
1.00000000000384 |
2 |
2 |
Timothy |
Bones |
M |
-12.00000000000920 |
-12 |
-12 |
Jason |
Jaones |
M |
-12.00000000000091 |
-12 |
-12 |
Tyler |
ones |
M |
-5.99999999999999 |
-6 |
-5 |
Notice that in rows 1, 5 and 9 CEIL returns the integer portion of the SCORE since the return values were within 1E-12 of the original argument. However, for these same records CEILZ rounds up to the nearest smallest integer because there was zero fuzzing allowed. In the last row the argument was within 1E-12 of -6, so CEIL considers these equivalent and therefore returns the value of -6, but with CEILZ returned the smallest integer that was greater than the argument, which is -5.
Address matching is a task for which fuzzy matching techniques are frequently used. It is an example of “phrase matching”, where there are multiple words in a phrase that need to match in order for two phrases to be considered equal. Consider the table below, a partial printout of selected street types from SASHELP.GCTYPE, in which there are a number of variations in street types from across the world.
NAME |
TYPE |
GROUP |
AV |
AVE |
12 |
AVE |
AVE |
12 |
AVEN |
AVE |
12 |
AVENIDA |
AVE |
12 |
AVENU |
AVE |
12 |
AVENUE |
AVE |
12 |
AVN |
AVE |
12 |
AVNUE |
AVE |
12 |
BELT |
BELT |
16 |
BELTWAY |
BELT |
16 |
BL |
BLVD |
34 |
BLVD |
BLVD |
34 |
BOUL |
BLVD |
34 |
BOULEVARD |
BLVD |
34 |
BOULV |
BLVD |
34 |
BTWY |
BELT |
16 |
CIR |
CIR |
64 |
CIRC |
CIR |
64 |
CIRCL |
CIR |
64 |
CIRCLE |
CIR |
64 |
CIRCULO |
CIR |
64 |
CRCL |
CIR |
64 |
CRCLE |
CIR |
64 |
CÍR |
CIR |
64 |
CÍRCLE |
CIR |
64 |
As you can see, Avenue can be spelled a number of ways. SAS supplies this look-up table for PROC GEOCODE, discussed below. In a file or files of addresses, such variations in the street type spelling are just the tip of the iceberg in terms of the vast panoply of “dirty data”. SAS uses this look-up table, and others, in performing fuzzy matches for street addresses (and other geographic entities such as county, congressional districts, etc.) and produces standardized addresses. In addition, we’ll discuss another SAS tool for fuzzy address matching, creating our own fuzzy function to perform the normalizing of street types below.
SAMPLE DATA FOR PROC GEOCODE
PROVNUM |
ADDRESS |
CITY |
STATE |
ZIP |
105205 |
2121 E COMMERCIAL BLVD |
FORT LAUDERDALE |
FL |
33308 |
106088 |
4650 STATE RD 16 |
SAINT AUGUSTINE |
FL |
32092 |
146035 |
2259 EAST 1100TH STREET |
MENDON |
IL |
62351 |
175446 |
915 MCNAIR STREET |
HALSTEAD |
KS |
67056 |
175549 |
12340 QUIVIRA ROAD |
OVERLAND PARK |
KS |
66213 |
245395 |
965 MCMILLAN STREET |
WORTHINGTON |
MN |
56187 |
385263 |
970 W JUNIPER AVENUE |
HERMISTON |
OR |
97838 |
525362 |
719 E CATHERINE ST BOX 167 |
DARLINGTON |
WI |
53530 |
525462 |
245 SYCAMORE ST |
SAUK CITY |
WI |
53583 |
676397 |
23450 PINE SHADOW LN |
PORTER |
TX |
77365 |
PROC GEOCODE
method=STREET
data=prov
out=dd.GEOCODED
lookupstreet=street.usm
type=SASHELP.GCTYPE;
run;
The results from running the data through PROC GEOCODE show that the variations for ‘street’ were all converted to ‘St’ and the rows with ‘road’ were changed to ‘Rd’. For some addresses, there may be situations where there are two addresses tied to a particular location. GEOCODE will “normalize” the addresses to the actual physical addresses for that location found in the look up file which are used by the USPS.
Before:
OBS |
ADDRESS |
CITY |
STATE |
ZIP |
1 |
2121 E COMMERCIAL BLVD |
FORT LAUDERDALE |
FL |
33308 |
2 |
4650 STATE RD 16 |
SAINT AUGUSTINE |
FL |
32092 |
3 |
2259 EAST 1100TH STREET |
MENDON |
IL |
62351 |
4 |
915 MCNAIR STREET |
HALSTEAD |
KS |
67056 |
5 |
12340 QUIVIRA ROAD |
OVERLAND PARK |
KS |
66213 |
6 |
965 MCMILLAN STREET |
WORTHINGTON |
MN |
56187 |
7 |
970 W JUNIPER AVENUE |
HERMISTON |
OR |
97838 |
8 |
719 E CATHERINE ST BOX 167 |
DARLINGTON |
WI |
53530 |
9 |
245 SYCAMORE ST |
SAUK CITY |
WI |
53583 |
10 |
23450 PINE SHADOW LN |
PORTER |
TX |
77365
|
After:
OBS |
M_ADDR |
M_CITY |
M_STATE |
M_ZIP |
1 |
2121 E Commercial Blvd |
Fort Lauderdale |
FL |
33308 |
2 |
4650 State Rd 16 |
Green Cove Springs |
FL |
32092 |
3 |
2237 E 1100th St |
Mendon |
IL |
62351 |
4 |
915 McNair St |
Halstead |
KS |
67056 |
5 |
12340 Quivira Rd |
Overland Park |
KS |
66213 |
6 |
965 McMillan St |
Worthington |
MN |
56187 |
7 |
970 W Juniper Ave |
Hermiston |
OR |
97838 |
8 |
8374 Co Rd E |
Darlington |
WI |
53530 |
9 |
245 Sycamore St |
Sauk City |
WI |
53583 |
10 |
24200 Pine Cir |
Porter |
TX |
77365 |
Many of the fuzzy matching techniques discussed above are case sensitive – so that frequently variables representing patterns need to be standardized with regard to case and punctuation. A full discussion of PROC FCMP is beyond the scope of this paper, but we will briefly discuss a user-defined function that can be helpful when performing fuzzy matching (and elsewhere). Use of a format library entry to identify non-standard street name terminology is a helpful tool – and the format can “learn” by including new variations as they are found. One method of using the results of the learned translations is to write a function incorporating the translations, as well as standardizing case, etc. A simplistic example follows below, in which street types are standardized prior to going into a fuzzy matching routine. As with the format, the function can be informed by new variations uncovered. In addition, the function performs such tasks as standardizing case, left justifying, and trimming.
proc fcmp outlib=work.funcs.address;
function streets(addr $) $;
length clean_address standardized_address $100;
clean_address=upcase(addr);
clean_address=left(trim(clean_address));
clean_address=tranwrd(clean_address,' STREET ',' ST ');
clean_address=tranwrd(clean_address,'ROAD','RD');
clean_address=tranwrd(clean_address,'BOULEVARD','BLVD');
clean_address=tranwrd(clean_address,'AVENUE','AVE');
clean_address=tranwrd(clean_address,' DRIVE ',' DR ');
clean_address=tranwrd(clean_address,'PLACE','PL');
clean_address=tranwrd(clean_address,'LANE','LN');
clean_address=tranwrd(clean_address,'CIRCLE','CIR');
clean_address=tranwrd(clean_address,'COURT ','CT ');
clean_address=tranwrd(clean_address,'PARKWAY','PKWY');
standardized_address=clean_address;
return(standardized_address);
endsub;
quit;
SAMPLE DATA FOR PROC FCMP
FIRSTNAME |
LASTNAME |
ADDRESS |
Jan |
Write |
1234 Any Place, Anywhere, NC 12345 |
Lucy |
Smyth |
5673 MyBlock Drive, Myhome, TX 79732 |
Kris |
Johnson |
19752 Home Blvd, Home, MA 03321 |
Chris |
Jones |
98 NewTown Circle, Newtown, OK 31313 |
Tracey |
Smith |
1294-13 Johnson Lane, Nowhere, MN 23213 |
Once a user-built function is created the options CMPLIB needs to point to the location of where the user-built function resides. The program snippet below illustrates the use of the option as well as implementing the function. After the execution of the program the addresses are cleaned so that there is consistency.
options cmplib=(work.funcs);
data roster3;
set roster2;
cleaned_address = streets(address);
run;
FIRSTNAME |
LASTNAME |
CLEANED_ADDRESS |
Jan |
Write |
1234 ANY PL, ANYWHERE, NC 12345 |
Lucy |
Smyth |
5673 MYBLOCK DRIVE, MYHOME, TX 79732 |
Kris |
Johnson |
19752 HOME BLVD, HOME, MA 03321 |
Chris |
Jones |
98 NEWTOWN CIR, NEWTOWN, OK 31313 |
Tracey |
Smith |
1294-13 JOHNSON LN, NOWHERE, MN 23213 |
SAS has provided a myriad of tools to utilize for “fuzzy” matching. Selection of records with where statements (conditions and special operators) and if statements (:_ operator and functions); standardizing of records using fuzzy matching techniques including user defined formats, functions, and PROC GEOCODE (address information); and PROC FCMP (a user-defined function to clean addresses) are all discussed in the full paper attached. We hope you’ve gained some appreciation for the “fuzz” and you’ll get “fuzzy” along with us!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Ready to level-up your skills? Choose your own adventure.