Hi, thank you in advance for your assistance. @ballardw and @Reeza have gotten me thus far using arrays but now I'm at a point that I need to merge two data sets with no clear key. So I don't know what to use for a by variable. Ideally, I would merge based upon owner_city (by the way, this variable exists in both data sets...), however due to unclean data within the primary data set, intl, the 'clean' owner_city by key could be contained within a number of candidate fields. Or, there could be no match at all.. Either way, I'm hoping to keep all records from intl, matches and non-matches. So for intl, I'm attempting to group these candidate fields within an array, entity_list. And then use a do loop to search for matches between owner_city from wrld_cities and the array entity_list from intl.
The code below is definitely not functional. I'm just showing it to hopefully convey what I'm trying to do... Any assistance would be greatly appreciated. 😊
Andy
</>
*----------------------------------------------------------------------------*;
* Merge Intl data set with Wrld Cities to bring in complete foreign address info *;
*----------------------------------------------------------------------------*;
data intl_adrs;
merge intl(in=i)
wrld_cities (in=w);
array entity_list {&num_elements_v2.} $60 owner_city words_parsed1-words_parsed&maxwords. words_parsedv21-words_parsedv2&maxwords.;
do i=1 to dim(entity_list);
if i.owner_city in (w.entity_list{i}) then output;
else if i output;
end;
run;
</>
You're very welcome.
I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).
Some observations:
So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.
Jim
Well, I'm no expert on fuzzy matching, but this article covers some of the basics. I think you have to clean up the keys first, then merge.
https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
Jim
@Reeza wrote:
then do a fizzy look up.
You buying? 😉 I'll drink to that. 🍻 Might help my programming actually.
Cheers,
Jim
Did you see what I posted below? Is that of any help? I was able to match 60% of the cities.
Jim
You know, just looking at the data, it's not that bad. I mean, you can match over half the file on city with just conventional means (no fuzzy match -- yet). The city file is very limited. If the city file had more data, I think we could match more.
With just conventional SAS array programming, I was able to get a match on 60% of the cities with just the limited city file that was posted. See code, below.
Now, there's a lot more to do here, but, shoot, use conventional means where you can, and then use fuzzy matching to increase your match rate beyond what conventional means can do.
I posted my results below the code. If I was able to get a traditional text match, then I put the matching city name into a column called Confirmed_City. I also set a Boolean indicator, City_Confirmed, which has a value of true (1) if I got a traditional text match and false (0) if I did not. By the way, I didn't consider the city a match unless the country code matched too.
Jim
Log:
NOTE: There were 12 observations read from the data set WORK.CITIES.
NOTE: The import data set has 28 observations and 26 variables.
-------------------------
NOTE: _Records_Read=28
_Records_Matched=17
_Percent_Match=60.71%
-------------------------
NOTE: There were 28 observations read from the data set INTL.Sheet1.
NOTE: The data set WORK.INTL has 28 observations and 26 variables.
SAS Code:
**------------------------------------------------------------------------------**;
LIBNAME Intl XLSX "&Path\SampleData\intl.xlsx";
LIBNAME Cities XLSX "&Path\SampleData\wrld_cities.xlsx";
**------------------------------------------------------------------------------**;
DATA WORK.Cities(RENAME=(Owner_Cntry_Abbrev=_Comp_Cntry));
SET Cities.Sheet1;
_comp_city = COMPRESS(owner_city,,'KA');
RUN;
**------------------------------------------------------------------------------**;
DATA WORK.Intl;
DROP _:;
IF _End_Of_Data THEN
DO;
_Percent_Match = (_Records_Matched / _Records_Read);
PUTLOG "NOTE- ";
PUTLOG "NOTE- -------------------------";
PUTLOG "NOTE: " _Records_Read= COMMA17.;
PUTLOG "NOTE- " _Records_Matched= COMMA17.;
PUTLOG "NOTE- " _Percent_Match= PERCENT8.2;
PUTLOG "NOTE- -------------------------";
PUTLOG "NOTE- ";
END;
SET Intl.Sheet1 (DROP=I _I)
END=_End_Of_Data;
ARRAY Search_Strings [*] $24 owner_city words_parsed1 -- words_parsedv23;
LENGTH Confirmed_City $24;
LENGTH _Comp_City $24;
LENGTH _Comp_Cntry $2;
IF _N_ = 1 THEN
DO;
CALL MISSING (_comp_city, _Comp_Cntry);
DECLARE HASH Hsh_Cities (DATASET: 'WORK.Cities');
Hsh_Cities.DefineKey ('_comp_city');
Hsh_Cities.DefineData ('_Comp_Cntry');
Hsh_Cities.DefineDone ();
END;
City_Confirmed = 0;
Confirmed_City = ' ';
_Records_Read + 1;
DO _i = 1 TO DIM(Search_Strings);
_Comp_City = COMPRESS(Search_Strings[_i],,'KA');
_RC = Hsh_Cities.FIND();
IF _RC = 0 AND
_Comp_Cntry = Owner_Cntry_Abbrev THEN
DO;
City_Confirmed = 1;
Confirmed_City = Search_Strings[_i];
_Records_Matched + 1;
_i = DIM(Search_Strings);
END;
END;
RUN;
**------------------------------------------------------------------------------**;
Partial Results (just the last set of columns):
You're very welcome.
I just got back from a hike, and I was reviewing the data vs. the results. The results are actually pretty darned good. (of course this is a small dataset and the real measure of efficacy will be garnered from a larger sample).
Some observations:
So, at least with this small set of data, the array/hash look up method is working surprisingly well -- when we have data. The Wrld_Cities file is missing a lot of data. To improve the match rate, the first thing we might want to do is to get more data put into the Wrld_Cities file.
Jim
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.