BookmarkSubscribeRSS Feed

Quick, Call the "FUZZ": Using Fuzzy Logic

Started ‎03-15-2021 by
Modified ‎04-16-2021 by
Views 1,276
Paper 1189-2021
Authors
 
 
  Richann Watson, DataRich Consulting; Louise Hadden, Abt Associates

Abstract

SAS® practitioners are frequently called upon to do a comparison of data between two different data sets and find that the values in synonymous fields do not line up exactly. A second quandary occurs when there is one data source to search for particular values, but those values are contained in character fields in which the values can be represented in myriad different ways. This presentation discusses robust, if not warm and fuzzy, techniques for comparing data between, and selecting data in, SAS data sets in not so ideal conditions. SAS has provided a number of tools which can perform fuzzy matching. Among these tools are wild card searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; fuzzy comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this presentation is to introduce, by example, each of these tools and techniques.

InTRODUCTIOn

SAS has provided a number of tools which can perform “fuzzy matching”. Among these tools are “wild card” searches in a where statement using the LIKE (alias ?) and CONTAINS operators; string searches in an if statement using operators and functions; “fuzzy” comparison functions such as COMPGED and SPEDIS; PROC FCMP; PROC GEOCODE, and data driven control tables enabling user-defined formats to standardize data. The purpose of this paper is to introduce, by example, each of these tools and techniques. We urge you to look at our paper for more complete information - this article will just contain "teasers".

Sample Data i

Below is a table containing sample data for most of our fuzzy function examples in the paper. In the sections below, we'll present a single example of selected tools and techniques. Please see our paper for more exhaustive details on everything you ever wanted to know about fuzzy functions!

 

FIRSTNAME

LASTNAME

GENDER

SCORE

AGE

EMAIL

Jan

Write

F

1.00000000000012

44.9999999999990

Jan_Write@mail.org

Lucy

Smyth

F

1.00000000000121

53.9383983572895

Lucy_Smyth@mail.org

Kris

Johnson

F

1.00324325660746

39.3566050650240

Kris_Johnson@mail.org

Chris

Jones

M

0.00000000000121

48.8268309377139

Chris_Jones@mail.org

Tracey

Smith

F

0.00000000000012

46.4499999999991

Tracey_Smith@mail.org

Tracy

Besley

M

0.00324325660746

47.4999999999990

Tracy_Besley@mail.org

Tracie

Smith-jones

F

-2.00000000000921

35.6659822039699

Tracie_Smith-jones@mail.org

Chrys

Jones-Wright

F

-2.00000000000092

34.1546885694730

Chrys_Jones-Wright@mail.org

Jon

Wright

M

1.00000000000038

46.8145106091718

Jon_Wright@mail.org

John

Hall

M

1.00000000000384

42.9949999999900

John_Hall@mail.org

Timothy

Bones

M

-12.00000000000920

48.3504449007529

Timothy_Bones@mail.org

Jason

Jaones

M

-12.00000000000091

48.7118412046543

Jason_Jaones@mail.org

Tyler

ones

M

-5.00413456081936

42.9499999999990

Tyler_ones@mail.org

 

WHERE STATEMENT WILD CARD TECHNIQUES

There are times when you may need to subset records in a data file but specifying the full search string may be an arduous and error prone process, or there may be multiple search strings that have a similar, identifiable pattern. You could default to the standard logic of VAR in (‘FULL VALUE1’ ‘FULL VALUE2’ etc.), or, you can use the following “fuzzy” techniques.

 

CONTAINS or ? Operators

If you are using a sub-setting WHERE statement you can use a question mark (?) or the word CONTAINS instead of an equal (=). The CONTAINS or ? will allow look for records that have values that contain what is specified.

 

Using the ROSTER data set shown above for people that have ‘Jones’ in their last name, the CONTAINS or ? operators can minimize the chance of error by searching for any record that contains the word ‘Jones’ in the variable LASTNAME.

 

data jones_1;	
    set roster;	
    where LASTNAME ? 'Jones';	
run;	

data jones_2;
    set roster;
    where LASTNAME contains 'Jones';
run;

Regardless of which operator is used they will both yield the same results.

 

FIRSTNAME

LASTNAME

GENDER

Chris

Jones

M

Chrys

Jones-Wright

F

 

It is important to note that the CONTAINS or ? operators are case sensitive, thus if casing should be ignored, then it should be used in conjunction with the LOWCASE function or UPCASE function to force the variable to be one case.

SEARching for a string with if statements

There are several different SAS functions-based options that can be explored when using IF statements for which wild card techniques are not applicable. These “fuzzy” search techniques are explored below.

 

INDEX Function

If you need to search for a string anywhere within a variable or another string, then the INDEX function could be utilized. With the INDEX function you would need to provide two arguments: the variable or string to be searched and the string that you are searching for.

 

Syntax: INDEX(source, excerpt)

 

data jones_8a;
    set roster;
    if index(LASTNAME, 'Jones');
run;

 

FIRSTNAME

LASTNAME

GENDER

Chris

Jones

M

Chrys

Jones-Wright

F


The display above shows a subset of the ROSTER data file containing last names of Jones using the INDEX function. The downside of using INDEX is that it is case sensitive. If you need to look for values regardless of case status, you could use the UPCASE function or LOWCASE function to force the source string to be one case and then specify the excerpt string to be the same case. The program and display below show that LASTNAME = ‘Smith-jones’ was missed in the initial program execution because ‘jones’ was not proper case.

 

data jones_8b;
    set roster;
    if index(upcase(LASTNAME),'JONES');
run;

 

FIRSTNAME

LASTNAME

GENDER

Chris

Jones

M

Tracie

Smith-jones

F

Chrys

Jones-Wright

F

 

CHARACTER fuzzy comparisons

 

Character strings, especially strings describing names and addresses, are notoriously dirty and prone to spacing, length and punctuation issues. Any real-world comparison of character strings or selection based on character strings needs to be both flexible and configurable, i.e. the degree of “sameness” needs to be quantifiable. SAS provides several character functions that allow you to make a fuzzy comparison: COMPARE, COMPGED, COMPLEV, SOUNDEX and SPEDIS. Each of these functions use a different fuzzy algorithm and can be used in conjunction with one another to achieve a (subjectively) optimal match. Use of these functions produces inexact results by definition, and results must be reviewed carefully. We discuss the SOUNDEX function below.

 

SOUNDEX Function

The SOUNDEX function determines how much two character variables sound alike. It works best with the English language. It is equivalent to using =* (sounds like) on a WHERE statement.

 

Syntax: SOUNDEX(argument)

 

With the SOUNDEX function vowels and the letters ‘H’, ‘W’ and ‘Y’ are excluded except when it is the first character in the argument when determining if the argument sounds like a specific value. Other characters in the English alphabet are assigned one of the following values:

  • B, F, P, V -) 1
  • C, G, J, K, Q, S, X, Z -) 2
  • D, T -) 3
  • L -) 4
  • M, N -) 5
  • R -) 6

The value generated from SOUNDEX is the first character in the argument and then for each character in the argument that is not excluded is assigned one of the values above. If there are two or more consecutive characters assigned the same numeric value, then only the first one is kept.

 

To demonstrate the use of SOUNDEX, we execute the following data step  so that we can calculate the value generated from SOUNDEX using FIRSTNAME. 

 

data roster_soundex;
    set roster;
    FN_SOUND = soundex(FIRSTNAME); 
run;

 

FIRSTNAME

LASTNAME

FN_SOUND

Jan

Write

J5

Lucy

Smyth

L2

Kris

Johnson

K62

Chris

Jones

C62

Tracey

Smith

T62

Tracy

Besley

T62

Tracie

Smith-jones

T62

Chrys

Jones-Wright

C62

Jon

Wright

J5

John

Hall

J5

Timothy

Bones

T53

Jason

Jaones

J25

Tyler

ones

T46

 

For rows 4 and 8, we see that both yield a value of ‘C62’. This is because both started with ‘C’ and the ‘h’, ‘y’ and ‘i’ were discarded, leaving only ‘r’ and ‘s’. The ‘r’ was assigned a value of ‘6’ and ‘s’ was assigned a value of ‘2’. Notice that if the third row would have started with a ‘C’ instead of a ‘K’ it would have resulted in the same value. However, since it is started with a ‘K’, the result was ‘K62’.

 

For the rows 5-7, we see that the value was ‘T62’ and this was because we discarded all the vowels and ‘y’ after the first character, leaving only ‘r’ and ‘c’.

 

For rows 9 and 10, the ‘o’ and ‘h’ were discarded leaving only the ‘n’ after the first argument, resulting in a value of ‘J5’.

 

NUMERIC FUZZY COMPARISONS

All the techniques illustrated thus far have been dealing with character strings, but what if we have a numeric value? There are various options available for determining if a numeric value is equivalent to another numeric value. In some cases, the values will be exactly equal, and no additional comparison is needed. However, there are some cases where the values are not quite equal but are equal ‘enough’ so that if the values were rounded or truncated or a ‘fuzz’ factor is added then the values would be considered equal. Depending on the type of ‘fuzz’ factor you wish to consider will determine which function should be best utilized. Using the ROSTER data in Data Display 1, we will illustrate several numeric ‘fuzzy’ functions.

 

CEIL and CEILZ Functions

The CEIL function rounds UP to the nearest smallest integer that is greater than or equal to the argument, that is it will return an integer value that is greater than or equal to the argument. It uses fuzzing in order to avoid issues with floating points. If the result returned from the CEIL function is with 1E-12 of the argument, then the value is considered equal to the integer portion of the argument.

 

Syntax: CEIL(argument)

However, if you do not want to consider any fuzzing when rounding up to the nearest integer, then CEILZ is the function that should be used. CEILZ works the same as CEIL but it does not use fuzzing. Therefore, even if the return value is within 1E-12 of the argument it will round up to the nearest smallest integer instead of considering the value equal to the integer portion of the argument.

 

Syntax: CEILZ(argument)

 

data fuzz_score;
     set roster;
     S_CEIL = ceil(SCORE); 
     S_CEILZ = ceilz(SCORE); 
run;

 

FIRSTNAME

LASTNAME

GENDER

SCORE

S_CEIL

S_CEILZ

Jan

Write

F

1.00000000000012

1

2

Lucy

Smyth

F

1.00000000000121

2

2

Kris

Johnson

F

1.00324325660746

2

2

Chris

Jones

M

0.00000000000121

1

1

Tracey

Smith

F

0.00000000000012

0

1

Tracy

Besley

M

0.00324325660746

1

1

Tracie

Smith-jones

F

-2.00000000000921

-2

-2

Chrys

Jones-Wright

F

-2.00000000000092

-2

-2

Jon

Wright

M

1.00000000000038

1

2

John

Hall

M

1.00000000000384

2

2

Timothy

Bones

M

-12.00000000000920

-12

-12

Jason

Jaones

M

-12.00000000000091

-12

-12

Tyler

ones

M

-5.99999999999999

-6

-5

 

Notice that in rows 1, 5 and 9 CEIL returns the integer portion of the SCORE since the return values were within 1E-12 of the original argument. However, for these same records CEILZ rounds up to the nearest smallest integer because there was zero fuzzing allowed. In the last row the argument was within 1E-12 of -6, so CEIL considers these equivalent and therefore returns the value of -6, but with CEILZ returned the smallest integer that was greater than the argument, which is -5.

ADDRESS CHECKING WITH SAS

Address matching is a task for which fuzzy matching techniques are frequently used. It is an example of “phrase matching”, where there are multiple words in a phrase that need to match in order for two phrases to be considered equal. Consider the table below, a partial printout of selected street types from SASHELP.GCTYPE, in which there are a number of variations in street types from across the world.

 

NAME

TYPE

GROUP

AV

AVE

12

AVE

AVE

12

AVEN

AVE

12

AVENIDA

AVE

12

AVENU

AVE

12

AVENUE

AVE

12

AVN

AVE

12

AVNUE

AVE

12

BELT

BELT

16

BELTWAY

BELT

16

BL

BLVD

34

BLVD

BLVD

34

BOUL

BLVD

34

BOULEVARD

BLVD

34

BOULV

BLVD

34

BTWY

BELT

16

CIR

CIR

64

CIRC

CIR

64

CIRCL

CIR

64

CIRCLE

CIR

64

CIRCULO

CIR

64

CRCL

CIR

64

CRCLE

CIR

64

CÍR

CIR

64

CÍRCLE

CIR

64

 

As you can see, Avenue can be spelled a number of ways. SAS supplies this look-up table for PROC GEOCODE, discussed below. In a file or files of addresses, such variations in the street type spelling are just the tip of the iceberg in terms of the vast panoply of “dirty data”. SAS uses this look-up table, and others, in performing fuzzy matches for street addresses (and other geographic entities such as county, congressional districts, etc.) and produces standardized addresses. In addition, we’ll discuss another SAS tool for fuzzy address matching, creating our own fuzzy function to perform the normalizing of street types below.

 

 

SAMPLE DATA FOR PROC GEOCODE

PROVNUM

ADDRESS

CITY

STATE

ZIP

105205

2121 E COMMERCIAL BLVD

FORT LAUDERDALE

FL

33308

106088

4650 STATE RD 16

SAINT AUGUSTINE

FL

32092

146035

2259 EAST 1100TH STREET

MENDON

IL

62351

175446

915 MCNAIR STREET

HALSTEAD

KS

67056

175549

12340 QUIVIRA ROAD

OVERLAND PARK

KS

66213

245395

965 MCMILLAN STREET

WORTHINGTON

MN

56187

385263

970 W JUNIPER AVENUE

HERMISTON

OR

97838

525362

719 E CATHERINE ST BOX 167

DARLINGTON

WI

53530

525462

245 SYCAMORE ST

SAUK CITY

WI

53583

676397

23450 PINE SHADOW LN

PORTER

TX

77365

 

PROC GEOCODE
    method=STREET
    data=prov
    out=dd.GEOCODED
    lookupstreet=street.usm
    type=SASHELP.GCTYPE; 
run;

 

The results from running the data through PROC GEOCODE show that the variations for ‘street’ were all converted to ‘St’ and the rows with ‘road’ were changed to ‘Rd’. For some addresses, there may be situations where there are two addresses tied to a particular location. GEOCODE will “normalize” the addresses to the actual physical addresses for that location found in the look up file which are used by the USPS.

 

Before:

 

OBS

ADDRESS

CITY

STATE

ZIP

1

2121 E COMMERCIAL BLVD

FORT LAUDERDALE

FL

33308

2

4650 STATE RD 16

SAINT AUGUSTINE

FL

32092

3

2259 EAST 1100TH STREET

MENDON

IL

62351

4

915 MCNAIR STREET

HALSTEAD

KS

67056

5

12340 QUIVIRA ROAD

OVERLAND PARK

KS

66213

6

965 MCMILLAN STREET

WORTHINGTON

MN

56187

7

970 W JUNIPER AVENUE

HERMISTON

OR

97838

8

719 E CATHERINE ST BOX 167

DARLINGTON

WI

53530

9

245 SYCAMORE ST

SAUK CITY

WI

53583

10

23450 PINE SHADOW LN

PORTER

TX

77365

 

 

 

 

After:

 

OBS

M_ADDR

M_CITY

M_STATE

M_ZIP

1

2121 E Commercial Blvd

Fort Lauderdale

FL

33308

2

4650 State Rd 16

Green Cove Springs

FL

32092

3

2237 E 1100th St

Mendon

IL

62351

4

915 McNair St

Halstead

KS

67056

5

12340 Quivira Rd

Overland Park

KS

66213

6

965 McMillan St

Worthington

MN

56187

7

970 W Juniper Ave

Hermiston

OR

97838

8

8374 Co Rd E

Darlington

WI

53530

9

245 Sycamore St

Sauk City

WI

53583

10

24200 Pine Cir

Porter

TX

77365

PROC FCMP - CREATE YOUR OWN FUZZ FUNCTION

 

Many of the fuzzy matching techniques discussed above are case sensitive – so that frequently variables representing patterns need to be standardized with regard to case and punctuation. A full discussion of PROC FCMP is beyond the scope of this paper, but we will briefly discuss a user-defined function that can be helpful when performing fuzzy matching (and elsewhere). Use of a format library entry to identify non-standard street name terminology is a helpful tool – and the format can “learn” by including new variations as they are found. One method of using the results of the learned translations is to write a function incorporating the translations, as well as standardizing case, etc. A simplistic example follows below, in which street types are standardized prior to going into a fuzzy matching routine. As with the format, the function can be informed by new variations uncovered. In addition, the function performs such tasks as standardizing case, left justifying, and trimming.

 

proc fcmp outlib=work.funcs.address;
function streets(addr $) $;
length clean_address standardized_address $100; 
clean_address=upcase(addr);
clean_address=left(trim(clean_address));
clean_address=tranwrd(clean_address,' STREET ',' ST ');
clean_address=tranwrd(clean_address,'ROAD','RD');
clean_address=tranwrd(clean_address,'BOULEVARD','BLVD');
clean_address=tranwrd(clean_address,'AVENUE','AVE');
clean_address=tranwrd(clean_address,' DRIVE ',' DR ');
clean_address=tranwrd(clean_address,'PLACE','PL');
clean_address=tranwrd(clean_address,'LANE','LN');
clean_address=tranwrd(clean_address,'CIRCLE','CIR');
clean_address=tranwrd(clean_address,'COURT ','CT ');
clean_address=tranwrd(clean_address,'PARKWAY','PKWY');
standardized_address=clean_address;
return(standardized_address); endsub; quit;

 

SAMPLE DATA FOR PROC FCMP

 

FIRSTNAME

LASTNAME

ADDRESS

Jan

Write

1234 Any Place, Anywhere, NC 12345

Lucy

Smyth

5673 MyBlock Drive, Myhome, TX 79732

Kris

Johnson

19752 Home Blvd, Home, MA 03321

Chris

Jones

98 NewTown Circle, Newtown, OK 31313

Tracey

Smith

1294-13 Johnson Lane, Nowhere, MN 23213

 

Once a user-built function is created the options CMPLIB needs to point to the location of where the user-built function resides. The program snippet below illustrates the use of the option as well as implementing the function. After the execution of the program the addresses are cleaned so that there is consistency.

 

options cmplib=(work.funcs);
data roster3;
    set roster2;
    cleaned_address = streets(address); 
run;

 

FIRSTNAME

LASTNAME

CLEANED_ADDRESS

Jan

Write

1234 ANY PL, ANYWHERE, NC 12345

Lucy

Smyth

5673 MYBLOCK DRIVE, MYHOME, TX 79732

Kris

Johnson

19752 HOME BLVD, HOME, MA 03321

Chris

Jones

98 NEWTOWN CIR, NEWTOWN, OK 31313

Tracey

Smith

1294-13 JOHNSON LN, NOWHERE, MN 23213

 

Conclusion

SAS has provided a myriad of tools to utilize for “fuzzy” matching. Selection of records with where statements (conditions and special operators) and if statements (:_ operator and functions); standardizing of records using fuzzy matching techniques including user defined formats, functions, and PROC GEOCODE (address information); and PROC FCMP (a user-defined function to clean addresses) are all discussed in the full paper attached. We hope you’ve gained some appreciation for the “fuzz” and you’ll get “fuzzy” along with us!

 

Version history
Last update:
‎04-16-2021 07:27 PM
Updated by:

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Labels
Article Tags