Hi, thank you in advance for your assistance. @ballardw and @Reeza have gotten me thus far using arrays but now I'm at a point that I need to merge two data sets with no clear key. So I don't know what to use for a by variable. Ideally, I would merge based upon owner_city (by the way, this variable exists in both data sets...), however due to unclean data within the primary data set, intl, the 'clean' owner_city by key could be contained within a number of candidate fields. Or, there could be no match at all.. Either way, I'm hoping to keep all records from intl, matches and non-matches. So for intl, I'm attempting to group these candidate fields within an array, entity_list. And then use a do loop to search for matches between owner_city from wrld_cities and the array entity_list from intl.
The code below is definitely not functional. I'm just showing it to hopefully convey what I'm trying to do... Any assistance would be greatly appreciated. 😊
Andy
</>
*----------------------------------------------------------------------------*;
* Merge Intl data set with Wrld Cities to bring in complete foreign address info *;
*----------------------------------------------------------------------------*;
data intl_adrs;
merge intl(in=i)
wrld_cities (in=w);
array entity_list {&num_elements_v2.} $60 owner_city words_parsed1-words_parsed&maxwords. words_parsedv21-words_parsedv2&maxwords.;
do i=1 to dim(entity_list);
if i.owner_city in (w.entity_list{i}) then output;
else if i output;
end;
run;
</>
You're very welcome.
I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).
Some observations:
So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.
Jim
Well, I'm no expert on fuzzy matching, but this article covers some of the basics. I think you have to clean up the keys first, then merge.
https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
Jim
@Reeza wrote:
then do a fizzy look up.
You buying? 😉 I'll drink to that. 🍻 Might help my programming actually.
Cheers,
Jim
Did you see what I posted below? Is that of any help? I was able to match 60% of the cities.
Jim
You know, just looking at the data, it's not that bad. I mean, you can match over half the file on city with just conventional means (no fuzzy match -- yet). The city file is very limited. If the city file had more data, I think we could match more.
With just conventional SAS array programming, I was able to get a match on 60% of the cities with just the limited city file that was posted. See code, below.
Now, there's a lot more to do here, but, shoot, use conventional means where you can, and then use fuzzy matching to increase your match rate beyond what conventional means can do.
I posted my results below the code. If I was able to get a traditional text match, then I put the matching city name into a column called Confirmed_City. I also set a Boolean indicator, City_Confirmed, which has a value of true (1) if I got a traditional text match and false (0) if I did not. By the way, I didn't consider the city a match unless the country code matched too.
Jim
Log:
NOTE: There were 12 observations read from the data set WORK.CITIES. NOTE: The import data set has 28 observations and 26 variables. ------------------------- NOTE: _Records_Read=28 _Records_Matched=17 _Percent_Match=60.71% ------------------------- NOTE: There were 28 observations read from the data set INTL.Sheet1. NOTE: The data set WORK.INTL has 28 observations and 26 variables.
SAS Code:
**------------------------------------------------------------------------------**;
LIBNAME Intl XLSX "&Path\SampleData\intl.xlsx";
LIBNAME Cities XLSX "&Path\SampleData\wrld_cities.xlsx";
**------------------------------------------------------------------------------**;
DATA WORK.Cities(RENAME=(Owner_Cntry_Abbrev=_Comp_Cntry));
SET Cities.Sheet1;
_comp_city = COMPRESS(owner_city,,'KA');
RUN;
**------------------------------------------------------------------------------**;
DATA WORK.Intl;
DROP _:;
IF _End_Of_Data THEN
DO;
_Percent_Match = (_Records_Matched / _Records_Read);
PUTLOG "NOTE- ";
PUTLOG "NOTE- -------------------------";
PUTLOG "NOTE: " _Records_Read= COMMA17.;
PUTLOG "NOTE- " _Records_Matched= COMMA17.;
PUTLOG "NOTE- " _Percent_Match= PERCENT8.2;
PUTLOG "NOTE- -------------------------";
PUTLOG "NOTE- ";
END;
SET Intl.Sheet1 (DROP=I _I)
END=_End_Of_Data;
ARRAY Search_Strings [*] $24 owner_city words_parsed1 -- words_parsedv23;
LENGTH Confirmed_City $24;
LENGTH _Comp_City $24;
LENGTH _Comp_Cntry $2;
IF _N_ = 1 THEN
DO;
CALL MISSING (_comp_city, _Comp_Cntry);
DECLARE HASH Hsh_Cities (DATASET: 'WORK.Cities');
Hsh_Cities.DefineKey ('_comp_city');
Hsh_Cities.DefineData ('_Comp_Cntry');
Hsh_Cities.DefineDone ();
END;
City_Confirmed = 0;
Confirmed_City = ' ';
_Records_Read + 1;
DO _i = 1 TO DIM(Search_Strings);
_Comp_City = COMPRESS(Search_Strings[_i],,'KA');
_RC = Hsh_Cities.FIND();
IF _RC = 0 AND
_Comp_Cntry = Owner_Cntry_Abbrev THEN
DO;
City_Confirmed = 1;
Confirmed_City = Search_Strings[_i];
_Records_Matched + 1;
_i = DIM(Search_Strings);
END;
END;
RUN;
**------------------------------------------------------------------------------**;
Partial Results (just the last set of columns):
You're very welcome.
I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).
Some observations:
So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.
Jim
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.