Quick, Call the "FUZZ": Using Fuzzy Logic

1 Like

Paper 1189-2021

Authors

Richann Watson, DataRich Consulting; Louise Hadden, Abt Associates

Abstract

SAS® practitioners are frequently called upon to do a comparison of data between two different data sets and find that the values in synonymous fields do not line up exactly. A second quandary occurs when there is one data source to search for particular values, but those values are contained in character fields in which the values can be represented in myriad different ways. This presentation discusses robust, if not warm and fuzzy, techniques for comparing data between, and selecting data in, SAS data sets in not so ideal conditions. SAS has provided a number of tools which can perform fuzzy matching. Among these tools are wild card searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; fuzzy comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this presentation is to introduce, by example, each of these tools and techniques.

InTRODUCTIOn

SAS has provided a number of tools which can perform “fuzzy matching”. Among these tools are “wild card” searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; “fuzzy” comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this paper is to introduce, by example, each of these tools and techniques. We urge you to look at our paper for more complete information - this article will just contain "teasers".

Sample Data i

Below is a table containing sample data for most of our fuzzy function examples in the paper. In the sections below, we'll present a single example of selected tools and techniques. Please see our paper for more exhaustive details on everything you ever wanted to know about fuzzy functions!

FIRSTNAME	LASTNAME	GENDER	SCORE	AGE	EMAIL
Jan	Write	F	1.00000000000012	44.9999999999990	Jan_Write@mail.org
Lucy	Smyth	F	1.00000000000121	53.9383983572895	Lucy_Smyth@mail.org
Kris	Johnson	F	1.00324325660746	39.3566050650240	Kris_Johnson@mail.org
Chris	Jones	M	0.00000000000121	48.8268309377139	Chris_Jones@mail.org
Tracey	Smith	F	0.00000000000012	46.4499999999991	Tracey_Smith@mail.org
Tracy	Besley	M	0.00324325660746	47.4999999999990	Tracy_Besley@mail.org
Tracie	Smith-jones	F	-2.00000000000921	35.6659822039699	Tracie_Smith-jones@mail.org
Chrys	Jones-Wright	F	-2.00000000000092	34.1546885694730	Chrys_Jones-Wright@mail.org
Jon	Wright	M	1.00000000000038	46.8145106091718	Jon_Wright@mail.org
John	Hall	M	1.00000000000384	42.9949999999900	John_Hall@mail.org
Timothy	Bones	M	-12.00000000000920	48.3504449007529	Timothy_Bones@mail.org
Jason	Jaones	M	-12.00000000000091	48.7118412046543	Jason_Jaones@mail.org
Tyler	ones	M	-5.00413456081936	42.9499999999990	Tyler_ones@mail.org

WHERE STATEMENT WILD CARD TECHNIQUES

There are times when you may need to subset records in a data file but specifying the full search string may be an arduous and error prone process, or there may be multiple search strings that have a similar, identifiable pattern. You could default to the standard logic of VAR in (‘FULL VALUE1’ ‘FULL VALUE2’ etc.), or, you can use the following “fuzzy” techniques.

CONTAINS or ? Operators

If you are using a sub-setting WHERE statement you can use a question mark (?) or the word CONTAINS instead of an equal (=). The CONTAINS or ? will allow look for records that have values that contain what is specified.

Using the ROSTER data set shown above for people that have ‘Jones’ in their last name, the CONTAINS or ? operators can minimize the chance of error by searching for any record that contains the word ‘Jones’ in the variable LASTNAME.

data jones_1;	
    set roster;	
    where LASTNAME ? 'Jones';	
run;	

data jones_2;
    set roster;
    where LASTNAME contains 'Jones';
run;

Regardless of which operator is used they will both yield the same results.

FIRSTNAME	LASTNAME	GENDER
Chris	Jones	M
Chrys	Jones-Wright	F

It is important to note that the CONTAINS or ? operators are case sensitive, thus if casing should be ignored, then it should be used in conjunction with the LOWCASE function or UPCASE function to force the variable to be one case.

SEARching for a string with if statements

There are several different SAS functions-based options that can be explored when using IF statements for which wild card techniques are not applicable. These “fuzzy” search techniques are explored below.

INDEX Function

If you need to search for a string anywhere within a variable or another string, then the INDEX function could be utilized. With the INDEX function you would need to provide two arguments: the variable or string to be searched and the string that you are searching for.

Syntax: INDEX(source, excerpt)

data jones_8a;
    set roster;
    if index(LASTNAME, 'Jones');
run;

FIRSTNAME	LASTNAME	GENDER
Chris	Jones	M
Chrys	Jones-Wright	F

The display above shows a subset of the ROSTER data file containing last names of Jones using the INDEX function. The downside of using INDEX is that it is case sensitive. If you need to look for values regardless of case status, you could use the UPCASE function or LOWCASE function to force the source string to be one case and then specify the excerpt string to be the same case. The program and display below show that LASTNAME = ‘Smith-jones’ was missed in the initial program execution because ‘jones’ was not proper case.

data jones_8b;
    set roster;
    if index(upcase(LASTNAME),'JONES');
run;

FIRSTNAME	LASTNAME	GENDER
Chris	Jones	M
Tracie	Smith-jones	F
Chrys	Jones-Wright	F

CHARACTER fuzzy comparisons

Character strings, especially strings describing names and addresses, are notoriously dirty and prone to spacing, length and punctuation issues. Any real-world comparison of character strings or selection based on character strings needs to be both flexible and configurable, i.e. the degree of “sameness” needs to be quantifiable. SAS provides several character functions that allow you to make a fuzzy comparison: COMPARE, COMPGED, COMPLEV, SOUNDEX and SPEDIS. Each of these functions use a different fuzzy algorithm and can be used in conjunction with one another to achieve a (subjectively) optimal match. Use of these functions produces inexact results by definition, and results must be reviewed carefully. We discuss the SOUNDEX function below.

SOUNDEX Function

The SOUNDEX function determines how much two character variables sound alike. It works best with the English language. It is equivalent to using =* (sounds like) on a WHERE statement.

Syntax: SOUNDEX(argument)

With the SOUNDEX function vowels and the letters ‘H’, ‘W’ and ‘Y’ are excluded except when it is the first character in the argument when determining if the argument sounds like a specific value. Other characters in the English alphabet are assigned one of the following values:

B, F, P, V -) 1
C, G, J, K, Q, S, X, Z -) 2
D, T -) 3
L -) 4
M, N -) 5
R -) 6

The value generated from SOUNDEX is the first character in the argument and then for each character in the argument that is not excluded is assigned one of the values above. If there are two or more consecutive characters assigned the same numeric value, then only the first one is kept.

To demonstrate the use of SOUNDEX, we execute the following data step so that we can calculate the value generated from SOUNDEX using FIRSTNAME.

data roster_soundex;
    set roster;
    FN_SOUND = soundex(FIRSTNAME); 
run;

FIRSTNAME	LASTNAME	FN_SOUND
Jan	Write	J5
Lucy	Smyth	L2
Kris	Johnson	K62
Chris	Jones	C62
Tracey	Smith	T62
Tracy	Besley	T62
Tracie	Smith-jones	T62
Chrys	Jones-Wright	C62
Jon	Wright	J5
John	Hall	J5
Timothy	Bones	T53
Jason	Jaones	J25
Tyler	ones	T46

For rows 4 and 8, we see that both yield a value of ‘C62’. This is because both started with ‘C’ and the ‘h’, ‘y’ and ‘i’ were discarded, leaving only ‘r’ and ‘s’. The ‘r’ was assigned a value of ‘6’ and ‘s’ was assigned a value of ‘2’. Notice that if the third row would have started with a ‘C’ instead of a ‘K’ it would have resulted in the same value. However, since it is started with a ‘K’, the result was ‘K62’.

For the rows 5-7, we see that the value was ‘T62’ and this was because we discarded all the vowels and ‘y’ after the first character, leaving only ‘r’ and ‘c’.

For rows 9 and 10, the ‘o’ and ‘h’ were discarded leaving only the ‘n’ after the first argument, resulting in a value of ‘J5’.

NUMERIC FUZZY COMPARISONS

All the techniques illustrated thus far have been dealing with character strings, but what if we have a numeric value? There are various options available for determining if a numeric value is equivalent to another numeric value. In some cases, the values will be exactly equal, and no additional comparison is needed. However, there are some cases where the values are not quite equal but are equal ‘enough’ so that if the values were rounded or truncated or a ‘fuzz’ factor is added then the values would be considered equal. Depending on the type of ‘fuzz’ factor you wish to consider will determine which function should be best utilized. Using the ROSTER data in Data Display 1, we will illustrate several numeric ‘fuzzy’ functions.

CEIL and CEILZ Functions

The CEIL function rounds UP to the nearest smallest integer that is greater than or equal to the argument, that is it will return an integer value that is greater than or equal to the argument. It uses fuzzing in order to avoid issues with floating points. If the result returned from the CEIL function is with 1E-12 of the argument, then the value is considered equal to the integer portion of the argument.

Syntax: CEIL(argument)

However, if you do not want to consider any fuzzing when rounding up to the nearest integer, then CEILZ is the function that should be used. CEILZ works the same as CEIL but it does not use fuzzing. Therefore, even if the return value is within 1E-12 of the argument it will round up to the nearest smallest integer instead of considering the value equal to the integer portion of the argument.

Syntax: CEILZ(argument)

data fuzz_score;
     set roster;
     S_CEIL = ceil(SCORE); 
     S_CEILZ = ceilz(SCORE); 
run;

FIRSTNAME	LASTNAME	GENDER	SCORE	S_CEIL	S_CEILZ
Jan	Write	F	1.00000000000012	1	2
Lucy	Smyth	F	1.00000000000121	2	2
Kris	Johnson	F	1.00324325660746	2	2
Chris	Jones	M	0.00000000000121	1	1
Tracey	Smith	F	0.00000000000012	0	1
Tracy	Besley	M	0.00324325660746	1	1
Tracie	Smith-jones	F	-2.00000000000921	-2	-2
Chrys	Jones-Wright	F	-2.00000000000092	-2	-2
Jon	Wright	M	1.00000000000038	1	2
John	Hall	M	1.00000000000384	2	2
Timothy	Bones	M	-12.00000000000920	-12	-12
Jason	Jaones	M	-12.00000000000091	-12	-12
Tyler	ones	M	-5.99999999999999	-6	-5

Notice that in rows 1, 5 and 9 CEIL returns the integer portion of the SCORE since the return values were within 1E-12 of the original argument. However, for these same records CEILZ rounds up to the nearest smallest integer because there was zero fuzzing allowed. In the last row the argument was within 1E-12 of -6, so CEIL considers these equivalent and therefore returns the value of -6, but with CEILZ returned the smallest integer that was greater than the argument, which is -5.

ADDRESS CHECKING WITH SAS

Address matching is a task for which fuzzy matching techniques are frequently used. It is an example of “phrase matching”, where there are multiple words in a phrase that need to match in order for two phrases to be considered equal. Consider the table below, a partial printout of selected street types from SASHELP.GCTYPE, in which there are a number of variations in street types from across the world.

NAME	TYPE	GROUP
AV	AVE	12
AVE	AVE	12
AVEN	AVE	12
AVENIDA	AVE	12
AVENU	AVE	12
AVENUE	AVE	12
AVN	AVE	12
AVNUE	AVE	12
BELT	BELT	16
BELTWAY	BELT	16
BL	BLVD	34
BLVD	BLVD	34
BOUL	BLVD	34
BOULEVARD	BLVD	34
BOULV	BLVD	34
BTWY	BELT	16
CIR	CIR	64
CIRC	CIR	64
CIRCL	CIR	64
CIRCLE	CIR	64
CIRCULO	CIR	64
CRCL	CIR	64
CRCLE	CIR	64
CÍR	CIR	64
CÍRCLE	CIR	64

As you can see, Avenue can be spelled a number of ways. SAS supplies this look-up table for PROC GEOCODE, discussed below. In a file or files of addresses, such variations in the street type spelling are just the tip of the iceberg in terms of the vast panoply of “dirty data”. SAS uses this look-up table, and others, in performing fuzzy matches for street addresses (and other geographic entities such as county, congressional districts, etc.) and produces standardized addresses. In addition, we’ll discuss another SAS tool for fuzzy address matching, creating our own fuzzy function to perform the normalizing of street types below.

SAMPLE DATA FOR PROC GEOCODE

PROVNUM	ADDRESS	CITY	STATE	ZIP
105205	2121 E COMMERCIAL BLVD	FORT LAUDERDALE	FL	33308
106088	4650 STATE RD 16	SAINT AUGUSTINE	FL	32092
146035	2259 EAST 1100TH STREET	MENDON	IL	62351
175446	915 MCNAIR STREET	HALSTEAD	KS	67056
175549	12340 QUIVIRA ROAD	OVERLAND PARK	KS	66213
245395	965 MCMILLAN STREET	WORTHINGTON	MN	56187
385263	970 W JUNIPER AVENUE	HERMISTON	OR	97838
525362	719 E CATHERINE ST BOX 167	DARLINGTON	WI	53530
525462	245 SYCAMORE ST	SAUK CITY	WI	53583
676397	23450 PINE SHADOW LN	PORTER	TX	77365

PROC GEOCODE
    method=STREET
    data=prov
    out=dd.GEOCODED
    lookupstreet=street.usm
    type=SASHELP.GCTYPE; 
run;

The results from running the data through PROC GEOCODE show that the variations for ‘street’ were all converted to ‘St’ and the rows with ‘road’ were changed to ‘Rd’. For some addresses, there may be situations where there are two addresses tied to a particular location. GEOCODE will “normalize” the addresses to the actual physical addresses for that location found in the look up file which are used by the USPS.

Before:

OBS	ADDRESS	CITY	STATE	ZIP
1	2121 E COMMERCIAL BLVD	FORT LAUDERDALE	FL	33308
2	4650 STATE RD 16	SAINT AUGUSTINE	FL	32092
3	2259 EAST 1100TH STREET	MENDON	IL	62351
4	915 MCNAIR STREET	HALSTEAD	KS	67056
5	12340 QUIVIRA ROAD	OVERLAND PARK	KS	66213
6	965 MCMILLAN STREET	WORTHINGTON	MN	56187
7	970 W JUNIPER AVENUE	HERMISTON	OR	97838
8	719 E CATHERINE ST BOX 167	DARLINGTON	WI	53530
9	245 SYCAMORE ST	SAUK CITY	WI	53583
10	23450 PINE SHADOW LN	PORTER	TX	77365

After:

OBS	M_ADDR	M_CITY	M_STATE	M_ZIP
1	2121 E Commercial Blvd	Fort Lauderdale	FL	33308
2	4650 State Rd 16	Green Cove Springs	FL	32092
3	2237 E 1100th St	Mendon	IL	62351
4	915 McNair St	Halstead	KS	67056
5	12340 Quivira Rd	Overland Park	KS	66213
6	965 McMillan St	Worthington	MN	56187
7	970 W Juniper Ave	Hermiston	OR	97838
8	8374 Co Rd E	Darlington	WI	53530
9	245 Sycamore St	Sauk City	WI	53583
10	24200 Pine Cir	Porter	TX	77365

PROC FCMP - CREATE YOUR OWN FUZZ FUNCTION

Many of the fuzzy matching techniques discussed above are case sensitive – so that frequently variables representing patterns need to be standardized with regard to case and punctuation. A full discussion of PROC FCMP is beyond the scope of this paper, but we will briefly discuss a user-defined function that can be helpful when performing fuzzy matching (and elsewhere). Use of a format library entry to identify non-standard street name terminology is a helpful tool – and the format can “learn” by including new variations as they are found. One method of using the results of the learned translations is to write a function incorporating the translations, as well as standardizing case, etc. A simplistic example follows below, in which street types are standardized prior to going into a fuzzy matching routine. As with the format, the function can be informed by new variations uncovered. In addition, the function performs such tasks as standardizing case, left justifying, and trimming.

proc fcmp outlib=work.funcs.address;
function streets(addr $) $;
length clean_address standardized_address $100; 
clean_address=upcase(addr); 
clean_address=left(trim(clean_address)); 
clean_address=tranwrd(clean_address,' STREET ',' ST '); 
clean_address=tranwrd(clean_address,'ROAD','RD'); 
clean_address=tranwrd(clean_address,'BOULEVARD','BLVD'); 
clean_address=tranwrd(clean_address,'AVENUE','AVE'); 
clean_address=tranwrd(clean_address,' DRIVE ',' DR '); 
clean_address=tranwrd(clean_address,'PLACE','PL'); 
clean_address=tranwrd(clean_address,'LANE','LN'); 
clean_address=tranwrd(clean_address,'CIRCLE','CIR'); 
clean_address=tranwrd(clean_address,'COURT ','CT '); 
clean_address=tranwrd(clean_address,'PARKWAY','PKWY'); 
standardized_address=clean_address; 
return(standardized_address);
endsub;
quit;

SAMPLE DATA FOR PROC FCMP

FIRSTNAME	LASTNAME	ADDRESS
Jan	Write	1234 Any Place, Anywhere, NC 12345
Lucy	Smyth	5673 MyBlock Drive, Myhome, TX 79732
Kris	Johnson	19752 Home Blvd, Home, MA 03321
Chris	Jones	98 NewTown Circle, Newtown, OK 31313
Tracey	Smith	1294-13 Johnson Lane, Nowhere, MN 23213

Once a user-built function is created the options CMPLIB needs to point to the location of where the user-built function resides. The program snippet below illustrates the use of the option as well as implementing the function. After the execution of the program the addresses are cleaned so that there is consistency.

options cmplib=(work.funcs);
data roster3;
    set roster2;
    cleaned_address = streets(address); 
run;

FIRSTNAME	LASTNAME	CLEANED_ADDRESS
Jan	Write	1234 ANY PL, ANYWHERE, NC 12345
Lucy	Smyth	5673 MYBLOCK DRIVE, MYHOME, TX 79732
Kris	Johnson	19752 HOME BLVD, HOME, MA 03321
Chris	Jones	98 NEWTOWN CIR, NEWTOWN, OK 31313
Tracey	Smith	1294-13 JOHNSON LN, NOWHERE, MN 23213

Conclusion

SAS has provided a myriad of tools to utilize for “fuzzy” matching. Selection of records with where statements (conditions and special operators) and if statements (:_ operator and functions); standardizing of records using fuzzy matching techniques including user defined formats, functions, and PROC GEOCODE (address information); and PROC FCMP (a user-defined function to clean addresses) are all discussed in the full paper attached. We hope you’ve gained some appreciation for the “fuzz” and you’ll get “fuzzy” along with us!