Solved: Re: "fuzzy" merge using an array?

agbpilot · Posted 10-02-2020 06:05 PM

Hi, thank you in advance for your assistance. @ballardw and @Reeza have gotten me thus far using arrays but now I'm at a point that I need to merge two data sets with no clear key. So I don't know what to use for a by variable. Ideally, I would merge based upon owner_city (by the way, this variable exists in both data sets...), however due to unclean data within the primary data set, intl, the 'clean' owner_city by key could be contained within a number of candidate fields. Or, there could be no match at all.. Either way, I'm hoping to keep all records from intl, matches and non-matches. So for intl, I'm attempting to group these candidate fields within an array, entity_list. And then use a do loop to search for matches between owner_city from wrld_cities and the array entity_list from intl.

The code below is definitely not functional. I'm just showing it to hopefully convey what I'm trying to do... Any assistance would be greatly appreciated. 😊

Andy

</>

*----------------------------------------------------------------------------*;
* Merge Intl data set with Wrld Cities to bring in complete foreign address info *;
*----------------------------------------------------------------------------*;

data intl_adrs;
merge intl(in=i)
wrld_cities (in=w);

array entity_list {&num_elements_v2.} $60 owner_city words_parsed1-words_parsed&maxwords. words_parsedv21-words_parsedv2&maxwords.;

do i=1 to dim(entity_list);
if i.owner_city in (w.entity_list{i}) then output;
else if i output;
end;
run;

</>

jimbarbour · Posted 10-03-2020 04:24 PM

@agbpilot,

You're very welcome.

I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).

Some observations:

The Wrld_Cities file only has data for Puerto Rico and Japan, so no matter how fancy our logic is, we can only match country codes PR and JP for now.
In the Intl file, there are three records from PR, and all three were matched. That's 100% match. Can't get much better than that.
In the Intl file, there are 16 records from JP; 13 were matched. That's 81%. Not too shabby. The three that were not matched were not matched for the following reasons:
1. Two had only numbers with no alpha city. There's no way to text match on these. We have to take some other approach on these.
2. The third, Tokyo, had no corresponding record in the Wrld_Cities file.

So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.

Jim

View solution in original post

jimbarbour · Posted 10-02-2020 06:44 PM

Well, I'm no expert on fuzzy matching, but this article covers some of the basics. I think you have to clean up the keys first, then merge.

https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

Jim

Reeza · Posted 10-02-2020 11:59 PM

Standardize the data first so you can at least join on country and then do a fizzy look up. Fuzzy matches are not easy. What output do you need and what level of accuracy do you need?

jimbarbour · Posted 10-03-2020 12:03 AM

@Reeza wrote:
then do a fizzy look up.

You buying? 😉 I'll drink to that. 🍻 Might help my programming actually.

Cheers,

Jim

agbpilot · Posted 10-03-2020 10:01 AM

Oh ok thank you Reeza. I really appreciate it. The wrld_cities dataset is
essentially a lookup table to clean the owner city and owner country fields
within the intl dataset. A value for the owner country field in the intl
dataset is not always present, or it can also be inaccurate. So my
original plan was to join on owner city to bring back the proper owner
country value from wrld_cities into the intl dataset. However, the owner
city field within the intl dataset is also not properly formatted, so it's
unreliable to use that as a join variable/key. And so that's why I created
the array. The array serves to create a set of elements from the original
owner city field, one or more of which could serve as a properly formatted
owner city key to join to the wrld_city dataset. The problem is I have to
iterate through those elements in the array to preform some sort of a
conditional join to properly connect to the wrld_cities lookup table.

Yes I'm definitely in over my head on this one... Any further suggestions
or guidance would be greatly appreciated. And thank you as always.

Andy

jimbarbour · Posted 10-03-2020 11:12 AM

@agbpilot,

Did you see what I posted below? Is that of any help? I was able to match 60% of the cities.

Jim

jimbarbour · Posted 10-03-2020 01:34 AM

You know, just looking at the data, it's not that bad. I mean, you can match over half the file on city with just conventional means (no fuzzy match -- yet). The city file is very limited. If the city file had more data, I think we could match more.

With just conventional SAS array programming, I was able to get a match on 60% of the cities with just the limited city file that was posted. See code, below.

Now, there's a lot more to do here, but, shoot, use conventional means where you can, and then use fuzzy matching to increase your match rate beyond what conventional means can do.

I posted my results below the code. If I was able to get a traditional text match, then I put the matching city name into a column called Confirmed_City. I also set a Boolean indicator, City_Confirmed, which has a value of true (1) if I got a traditional text match and false (0) if I did not. By the way, I didn't consider the city a match unless the country code matched too.

Jim

Log:

NOTE: There were 12 observations read from the data set WORK.CITIES.
NOTE: The import data set has 28 observations and 26 variables.
       
       -------------------------
NOTE:   _Records_Read=28
        _Records_Matched=17
        _Percent_Match=60.71%
       -------------------------
       
NOTE: There were 28 observations read from the data set INTL.Sheet1.
NOTE: The data set WORK.INTL has 28 observations and 26 variables.

SAS Code:

**------------------------------------------------------------------------------**;

LIBNAME	Intl	XLSX	"&Path\SampleData\intl.xlsx";
LIBNAME	Cities	XLSX	"&Path\SampleData\wrld_cities.xlsx";

**------------------------------------------------------------------------------**;

DATA	WORK.Cities(RENAME=(Owner_Cntry_Abbrev=_Comp_Cntry));
	SET	Cities.Sheet1;
	_comp_city	=	COMPRESS(owner_city,,'KA');
RUN;

**------------------------------------------------------------------------------**;

DATA	WORK.Intl;
	DROP	_:;

	IF	_End_Of_Data											THEN
		DO;
			_Percent_Match				=	(_Records_Matched / _Records_Read);
			PUTLOG  "NOTE-  ";
			PUTLOG  "NOTE-  -------------------------";
			PUTLOG  "NOTE:   "	_Records_Read=		COMMA17.;
			PUTLOG  "NOTE-   "	_Records_Matched=	COMMA17.;
			PUTLOG  "NOTE-   "	_Percent_Match=		PERCENT8.2;
			PUTLOG  "NOTE-  -------------------------";
			PUTLOG  "NOTE-  ";
		END;

	SET	Intl.Sheet1	(DROP=I _I)	
		END=_End_Of_Data;

	ARRAY	Search_Strings	[*]	$24	owner_city words_parsed1 -- words_parsedv23;
	LENGTH	Confirmed_City	$24;
	LENGTH	_Comp_City		$24;
	LENGTH	_Comp_Cntry		$2;

	IF	_N_								=	1					THEN
		DO;
			CALL	MISSING							(_comp_city, _Comp_Cntry);
			DECLARE	HASH	Hsh_Cities				(DATASET: 'WORK.Cities');
							Hsh_Cities.DefineKey 	('_comp_city');
							Hsh_Cities.DefineData	('_Comp_Cntry');
							Hsh_Cities.DefineDone	();
		END;

	City_Confirmed						=	0;
	Confirmed_City						=	' ';
	_Records_Read						+	1;

	DO	_i	=	1	TO	DIM(Search_Strings);
		_Comp_City						=	COMPRESS(Search_Strings[_i],,'KA');
		_RC								=	Hsh_Cities.FIND();
		IF	_RC							=	0					AND
			_Comp_Cntry					=	Owner_Cntry_Abbrev	THEN
			DO;
				City_Confirmed			=	1;
				Confirmed_City			=	Search_Strings[_i];
				_Records_Matched		+	1;
				_i						=	DIM(Search_Strings);
			END;
	END;
RUN;

**------------------------------------------------------------------------------**;

Partial Results (just the last set of columns):

agbpilot · Posted 10-03-2020 01:11 PM

Thank you Sir for this example solution. Very cool technique.. Let me attempt implementing and I'll get back to you. Thanks again!

jimbarbour · Posted 10-03-2020 04:24 PM

@agbpilot,

You're very welcome.

I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).

Some observations:

The Wrld_Cities file only has data for Puerto Rico and Japan, so no matter how fancy our logic is, we can only match country codes PR and JP for now.
In the Intl file, there are three records from PR, and all three were matched. That's 100% match. Can't get much better than that.
In the Intl file, there are 16 records from JP; 13 were matched. That's 81%. Not too shabby. The three that were not matched were not matched for the following reasons:
1. Two had only numbers with no alpha city. There's no way to text match on these. We have to take some other approach on these.
2. The third, Tokyo, had no corresponding record in the Wrld_Cities file.

So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.

Jim

Registration is open

SAS Training: Just a Click Away