Paper 1189-2021
Authors
Richann Watson, DataRich Consulting; Louise Hadden, Abt Associates
Abstract
SAS® practitioners are frequently called upon to do a comparison of data between two different data sets and find that the values in synonymous fields do not line up exactly. A second quandary occurs when there is one data source to search for particular values, but those values are contained in character fields in which the values can be represented in myriad different ways. This presentation discusses robust, if not warm and fuzzy, techniques for comparing data between, and selecting data in, SAS data sets in not so ideal conditions. SAS has provided a number of tools which can perform fuzzy matching. Among these tools are wild card searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; fuzzy comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this presentation is to introduce, by example, each of these tools and techniques.
InTRODUCTIOn
SAS has provided a number of tools which can perform “fuzzy matching”. Among these tools are “wild card” searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; “fuzzy” comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this paper is to introduce, by example, each of these tools and techniques. We urge you to look at our paper for more complete information - this article will just contain "teasers".
Sample Data i
Below is a table containing sample data for most of our fuzzy function examples in the paper. In the sections below, we'll present a single example of selected tools and techniques. Please see our paper for more exhaustive details on everything you ever wanted to know about fuzzy functions!
FIRSTNAME
LASTNAME
GENDER
SCORE
AGE
EMAIL
Jan
Write
F
1.00000000000012
44.9999999999990
Jan_Write@mail.org
Lucy
Smyth
F
1.00000000000121
53.9383983572895
Lucy_Smyth@mail.org
Kris
Johnson
F
1.00324325660746
39.3566050650240
Kris_Johnson@mail.org
Chris
Jones
M
0.00000000000121
48.8268309377139
Chris_Jones@mail.org
Tracey
Smith
F
0.00000000000012
46.4499999999991
Tracey_Smith@mail.org
Tracy
Besley
M
0.00324325660746
47.4999999999990
Tracy_Besley@mail.org
Tracie
Smith-jones
F
-2.00000000000921
35.6659822039699
Tracie_Smith-jones@mail.org
Chrys
Jones-Wright
F
-2.00000000000092
34.1546885694730
Chrys_Jones-Wright@mail.org
Jon
Wright
M
1.00000000000038
46.8145106091718
Jon_Wright@mail.org
John
Hall
M
1.00000000000384
42.9949999999900
John_Hall@mail.org
Timothy
Bones
M
-12.00000000000920
48.3504449007529
Timothy_Bones@mail.org
Jason
Jaones
M
-12.00000000000091
48.7118412046543
Jason_Jaones@mail.org
Tyler
ones
M
-5.00413456081936
42.9499999999990
Tyler_ones@mail.org
WHERE STATEMENT WILD CARD TECHNIQUES
There are times when you may need to subset records in a data file but specifying the full search string may be an arduous and error prone process, or there may be multiple search strings that have a similar, identifiable pattern. You could default to the standard logic of VAR in (‘FULL VALUE1’ ‘FULL VALUE2’ etc.), or, you can use the following “fuzzy” techniques.
CONTAINS or ? Operators
If you are using a sub-setting WHERE statement you can use a question mark (?) or the word CONTAINS instead of an equal (=). The CONTAINS or ? will allow look for records that have values that contain what is specified.
Using the ROSTER data set shown above for people that have ‘Jones’ in their last name, the CONTAINS or ? operators can minimize the chance of error by searching for any record that contains the word ‘Jones’ in the variable LASTNAME.
data jones_1;
set roster;
where LASTNAME ? 'Jones';
run;
data jones_2;
set roster;
where LASTNAME contains 'Jones';
run;
Regardless of which operator is used they will both yield the same results.
FIRSTNAME
LASTNAME
GENDER
Chris
Jones
M
Chrys
Jones-Wright
F
It is important to note that the CONTAINS or ? operators are case sensitive, thus if casing should be ignored, then it should be used in conjunction with the LOWCASE function or UPCASE function to force the variable to be one case.
SEARching for a string with if statements
There are several different SAS functions-based options that can be explored when using IF statements for which wild card techniques are not applicable. These “fuzzy” search techniques are explored below.
INDEX Function
If you need to search for a string anywhere within a variable or another string, then the INDEX function could be utilized. With the INDEX function you would need to provide two arguments: the variable or string to be searched and the string that you are searching for.
Syntax: INDEX(source, excerpt)
data jones_8a;
set roster;
if index(LASTNAME, 'Jones');
run;
FIRSTNAME
LASTNAME
GENDER
Chris
Jones
M
Chrys
Jones-Wright
F
The display above shows a subset of the ROSTER data file containing last names of Jones using the INDEX function. The downside of using INDEX is that it is case sensitive. If you need to look for values regardless of case status, you could use the UPCASE function or LOWCASE function to force the source string to be one case and then specify the excerpt string to be the same case. The program and display below show that LASTNAME = ‘Smith-jones’ was missed in the initial program execution because ‘jones’ was not proper case.
data jones_8b;
set roster;
if index(upcase(LASTNAME),'JONES');
run;
FIRSTNAME
LASTNAME
GENDER
Chris
Jones
M
Tracie
Smith-jones
F
Chrys
Jones-Wright
F
CHARACTER fuzzy comparisons
Character strings, especially strings describing names and addresses, are notoriously dirty and prone to spacing, length and punctuation issues. Any real-world comparison of character strings or selection based on character strings needs to be both flexible and configurable, i.e. the degree of “sameness” needs to be quantifiable. SAS provides several character functions that allow you to make a fuzzy comparison: COMPARE, COMPGED, COMPLEV, SOUNDEX and SPEDIS. Each of these functions use a different fuzzy algorithm and can be used in conjunction with one another to achieve a (subjectively) optimal match. Use of these functions produces inexact results by definition, and results must be reviewed carefully. We discuss the SOUNDEX function below.
SOUNDEX Function
The SOUNDEX function determines how much two character variables sound alike. It works best with the English language. It is equivalent to using =* (sounds like) on a WHERE statement.
Syntax: SOUNDEX(argument)
With the SOUNDEX function vowels and the letters ‘H’, ‘W’ and ‘Y’ are excluded except when it is the first character in the argument when determining if the argument sounds like a specific value. Other characters in the English alphabet are assigned one of the following values:
B, F, P, V -) 1
C, G, J, K, Q, S, X, Z -) 2
D, T -) 3
L -) 4
M, N -) 5
R -) 6
The value generated from SOUNDEX is the first character in the argument and then for each character in the argument that is not excluded is assigned one of the values above. If there are two or more consecutive characters assigned the same numeric value, then only the first one is kept.
To demonstrate the use of SOUNDEX, we execute the following data step so that we can calculate the value generated from SOUNDEX using FIRSTNAME.
data roster_soundex;
set roster;
FN_SOUND = soundex(FIRSTNAME);
run;
FIRSTNAME
LASTNAME
FN_SOUND
Jan
Write
J5
Lucy
Smyth
L2
Kris
Johnson
K62
Chris
Jones
C62
Tracey
Smith
T62
Tracy
Besley
T62
Tracie
Smith-jones
T62
Chrys
Jones-Wright
C62
Jon
Wright
J5
John
Hall
J5
Timothy
Bones
T53
Jason
Jaones
J25
Tyler
ones
T46
For rows 4 and 8, we see that both yield a value of ‘C62’. This is because both started with ‘C’ and the ‘h’, ‘y’ and ‘i’ were discarded, leaving only ‘r’ and ‘s’. The ‘r’ was assigned a value of ‘6’ and ‘s’ was assigned a value of ‘2’. Notice that if the third row would have started with a ‘C’ instead of a ‘K’ it would have resulted in the same value. However, since it is started with a ‘K’, the result was ‘K62’.
For the rows 5-7, we see that the value was ‘T62’ and this was because we discarded all the vowels and ‘y’ after the first character, leaving only ‘r’ and ‘c’.
For rows 9 and 10, the ‘o’ and ‘h’ were discarded leaving only the ‘n’ after the first argument, resulting in a value of ‘J5’.
NUMERIC FUZZY COMPARISONS
All the techniques illustrated thus far have been dealing with character strings, but what if we have a numeric value? There are various options available for determining if a numeric value is equivalent to another numeric value. In some cases, the values will be exactly equal, and no additional comparison is needed. However, there are some cases where the values are not quite equal but are equal ‘enough’ so that if the values were rounded or truncated or a ‘fuzz’ factor is added then the values would be considered equal. Depending on the type of ‘fuzz’ factor you wish to consider will determine which function should be best utilized. Using the ROSTER data in Data Display 1, we will illustrate several numeric ‘fuzzy’ functions.
CEIL and CEILZ Functions
The CEIL function rounds UP to the nearest smallest integer that is greater than or equal to the argument, that is it will return an integer value that is greater than or equal to the argument. It uses fuzzing in order to avoid issues with floating points. If the result returned from the CEIL function is with 1E-12 of the argument, then the value is considered equal to the integer portion of the argument.
Syntax: CEIL(argument)
However, if you do not want to consider any fuzzing when rounding up to the nearest integer, then CEILZ is the function that should be used. CEILZ works the same as CEIL but it does not use fuzzing. Therefore, even if the return value is within 1E-12 of the argument it will round up to the nearest smallest integer instead of considering the value equal to the integer portion of the argument.
Syntax: CEILZ(argument)
data fuzz_score;
set roster;
S_CEIL = ceil(SCORE);
S_CEILZ = ceilz(SCORE);
run;
FIRSTNAME
LASTNAME
GENDER
SCORE
S_CEIL
S_CEILZ
Jan
Write
F
1.00000000000012
1
2
Lucy
Smyth
F
1.00000000000121
2
2
Kris
Johnson
F
1.00324325660746
2
2
Chris
Jones
M
0.00000000000121
1
1
Tracey
Smith
F
0.00000000000012
0
1
Tracy
Besley
M
0.00324325660746
1
1
Tracie
Smith-jones
F
-2.00000000000921
-2
-2
Chrys
Jones-Wright
F
-2.00000000000092
-2
-2
Jon
Wright
M
1.00000000000038
1
2
John
Hall
M
1.00000000000384
2
2
Timothy
Bones
M
-12.00000000000920
-12
-12
Jason
Jaones
M
-12.00000000000091
-12
-12
Tyler
ones
M
-5.99999999999999
-6
-5
Notice that in rows 1, 5 and 9 CEIL returns the integer portion of the SCORE since the return values were within 1E-12 of the original argument. However, for these same records CEILZ rounds up to the nearest smallest integer because there was zero fuzzing allowed. In the last row the argument was within 1E-12 of -6, so CEIL considers these equivalent and therefore returns the value of -6, but with CEILZ returned the smallest integer that was greater than the argument, which is -5.
ADDRESS CHECKING WITH SAS
Address matching is a task for which fuzzy matching techniques are frequently used. It is an example of “phrase matching”, where there are multiple words in a phrase that need to match in order for two phrases to be considered equal. Consider the table below, a partial printout of selected street types from SASHELP.GCTYPE, in which there are a number of variations in street types from across the world.
NAME
TYPE
GROUP
AV
AVE
12
AVE
AVE
12
AVEN
AVE
12
AVENIDA
AVE
12
AVENU
AVE
12
AVENUE
AVE
12
AVN
AVE
12
AVNUE
AVE
12
BELT
BELT
16
BELTWAY
BELT
16
BL
BLVD
34
BLVD
BLVD
34
BOUL
BLVD
34
BOULEVARD
BLVD
34
BOULV
BLVD
34
BTWY
BELT
16
CIR
CIR
64
CIRC
CIR
64
CIRCL
CIR
64
CIRCLE
CIR
64
CIRCULO
CIR
64
CRCL
CIR
64
CRCLE
CIR
64
CÍR
CIR
64
CÍRCLE
CIR
64
As you can see, Avenue can be spelled a number of ways. SAS supplies this look-up table for PROC GEOCODE, discussed below. In a file or files of addresses, such variations in the street type spelling are just the tip of the iceberg in terms of the vast panoply of “dirty data”. SAS uses this look-up table, and others, in performing fuzzy matches for street addresses (and other geographic entities such as county, congressional districts, etc.) and produces standardized addresses. In addition, we’ll discuss another SAS tool for fuzzy address matching, creating our own fuzzy function to perform the normalizing of street types below.
SAMPLE DATA FOR PROC GEOCODE
PROVNUM
ADDRESS
CITY
STATE
ZIP
105205
2121 E COMMERCIAL BLVD
FORT LAUDERDALE
FL
33308
106088
4650 STATE RD 16
SAINT AUGUSTINE
FL
32092
146035
2259 EAST 1100TH STREET
MENDON
IL
62351
175446
915 MCNAIR STREET
HALSTEAD
KS
67056
175549
12340 QUIVIRA ROAD
OVERLAND PARK
KS
66213
245395
965 MCMILLAN STREET
WORTHINGTON
MN
56187
385263
970 W JUNIPER AVENUE
HERMISTON
OR
97838
525362
719 E CATHERINE ST BOX 167
DARLINGTON
WI
53530
525462
245 SYCAMORE ST
SAUK CITY
WI
53583
676397
23450 PINE SHADOW LN
PORTER
TX
77365
PROC GEOCODE
method=STREET
data=prov
out=dd.GEOCODED
lookupstreet=street.usm
type=SASHELP.GCTYPE;
run;
The results from running the data through PROC GEOCODE show that the variations for ‘street’ were all converted to ‘St’ and the rows with ‘road’ were changed to ‘Rd’. For some addresses, there may be situations where there are two addresses tied to a particular location. GEOCODE will “normalize” the addresses to the actual physical addresses for that location found in the look up file which are used by the USPS.
Before:
OBS
ADDRESS
CITY
STATE
ZIP
1
2121 E COMMERCIAL BLVD
FORT LAUDERDALE
FL
33308
2
4650 STATE RD 16
SAINT AUGUSTINE
FL
32092
3
2259 EAST 1100TH STREET
MENDON
IL
62351
4
915 MCNAIR STREET
HALSTEAD
KS
67056
5
12340 QUIVIRA ROAD
OVERLAND PARK
KS
66213
6
965 MCMILLAN STREET
WORTHINGTON
MN
56187
7
970 W JUNIPER AVENUE
HERMISTON
OR
97838
8
719 E CATHERINE ST BOX 167
DARLINGTON
WI
53530
9
245 SYCAMORE ST
SAUK CITY
WI
53583
10
23450 PINE SHADOW LN
PORTER
TX
77365
After:
OBS
M_ADDR
M_CITY
M_STATE
M_ZIP
1
2121 E Commercial Blvd
Fort Lauderdale
FL
33308
2
4650 State Rd 16
Green Cove Springs
FL
32092
3
2237 E 1100th St
Mendon
IL
62351
4
915 McNair St
Halstead
KS
67056
5
12340 Quivira Rd
Overland Park
KS
66213
6
965 McMillan St
Worthington
MN
56187
7
970 W Juniper Ave
Hermiston
OR
97838
8
8374 Co Rd E
Darlington
WI
53530
9
245 Sycamore St
Sauk City
WI
53583
10
24200 Pine Cir
Porter
TX
77365
PROC FCMP - CREATE YOUR OWN FUZZ FUNCTION
Many of the fuzzy matching techniques discussed above are case sensitive – so that frequently variables representing patterns need to be standardized with regard to case and punctuation. A full discussion of PROC FCMP is beyond the scope of this paper, but we will briefly discuss a user-defined function that can be helpful when performing fuzzy matching (and elsewhere). Use of a format library entry to identify non-standard street name terminology is a helpful tool – and the format can “learn” by including new variations as they are found. One method of using the results of the learned translations is to write a function incorporating the translations, as well as standardizing case, etc. A simplistic example follows below, in which street types are standardized prior to going into a fuzzy matching routine. As with the format, the function can be informed by new variations uncovered. In addition, the function performs such tasks as standardizing case, left justifying, and trimming.
proc fcmp outlib=work.funcs.address;
function streets(addr $) $;
length clean_address standardized_address $100; clean_address=upcase(addr); clean_address=left(trim(clean_address)); clean_address=tranwrd(clean_address,' STREET ',' ST '); clean_address=tranwrd(clean_address,'ROAD','RD'); clean_address=tranwrd(clean_address,'BOULEVARD','BLVD'); clean_address=tranwrd(clean_address,'AVENUE','AVE'); clean_address=tranwrd(clean_address,' DRIVE ',' DR '); clean_address=tranwrd(clean_address,'PLACE','PL'); clean_address=tranwrd(clean_address,'LANE','LN'); clean_address=tranwrd(clean_address,'CIRCLE','CIR'); clean_address=tranwrd(clean_address,'COURT ','CT '); clean_address=tranwrd(clean_address,'PARKWAY','PKWY'); standardized_address=clean_address; return(standardized_address);
endsub;
quit;
SAMPLE DATA FOR PROC FCMP
FIRSTNAME
LASTNAME
ADDRESS
Jan
Write
1234 Any Place, Anywhere, NC 12345
Lucy
Smyth
5673 MyBlock Drive, Myhome, TX 79732
Kris
Johnson
19752 Home Blvd, Home, MA 03321
Chris
Jones
98 NewTown Circle, Newtown, OK 31313
Tracey
Smith
1294-13 Johnson Lane, Nowhere, MN 23213
Once a user-built function is created the options CMPLIB needs to point to the location of where the user-built function resides. The program snippet below illustrates the use of the option as well as implementing the function. After the execution of the program the addresses are cleaned so that there is consistency.
options cmplib=(work.funcs);
data roster3;
set roster2;
cleaned_address = streets(address);
run;
FIRSTNAME
LASTNAME
CLEANED_ADDRESS
Jan
Write
1234 ANY PL, ANYWHERE, NC 12345
Lucy
Smyth
5673 MYBLOCK DRIVE, MYHOME, TX 79732
Kris
Johnson
19752 HOME BLVD, HOME, MA 03321
Chris
Jones
98 NEWTOWN CIR, NEWTOWN, OK 31313
Tracey
Smith
1294-13 JOHNSON LN, NOWHERE, MN 23213
Conclusion
SAS has provided a myriad of tools to utilize for “fuzzy” matching. Selection of records with where statements (conditions and special operators) and if statements (:_ operator and functions); standardizing of records using fuzzy matching techniques including user defined formats, functions, and PROC GEOCODE (address information); and PROC FCMP (a user-defined function to clean addresses) are all discussed in the full paper attached. We hope you’ve gained some appreciation for the “fuzz” and you’ll get “fuzzy” along with us!
... View more