10-31-2016 12:49 PM
The question is about identifying the similar observations in SAS.
In the main dataset that I have there are dealers with multiple ids and similar addresses. Because the addresses were entered manually, in many cases the addresses don't match. Here is an example below:
Is there a way in SAS to identify these entries and create a new entry (New_Dealer_ID) ?
I cannot hard-code it because the dataset is bigger and there are variations in how the addresses were entered.
I also found a couple of Fuzzy-matching documentation on the internet but they look way more complicated compared to what I am trying to do.
Thank you for your help
10-31-2016 01:25 PM
How many records are involved? If the number is massive you may need some pretty complex tools, if there are a few hundred then not so many.
There is an application from CDC called Link-Plus that does some name/address/other characteristic matching and assingns a "match likelihood" type score.
10-31-2016 05:11 PM
So you have a sort-of-large (defined as too bing to spend a lot of manual time on but not in the million plus range) problem.
I would take a look at how many exact matches you can find. If the number found is is much less than 9K then other tools.
I have used Link plus as there is no charge for use, it works pretty well and has options that let you set thresholds for "assumed match" which may be more important for larger problem sets. You could examine it here: http://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm
10-31-2016 06:45 PM
Do you have a way to limit comparisons, for example a field that is reliable, city/town? Otherwise that's a heck of a lot of comparisons that need to occur.
11-01-2016 01:23 AM
Fuzz matching is really uneasy. data x; input Dealer_id Address_1 $20.; cards; 1 Rome 1020 2 Rome 1020P 3 Paris-SuiteB 4 Paris SuiteB ; run; proc sql; create table have as select a.address_1 as from,b.address_1 as to from x as a,x as b where a.address_1 ne b.address_1 group by a.address_1 having spedis(a.address_1,b.address_1)= min(spedis(a.address_1,b.address_1)); quit; data full; set have end=last; if _n_ eq 1 then do; declare hash h(); h.definekey('node'); h.definedata('node'); h.definedone(); end; output; node=from; h.replace(); from=to; to=node; output; node=from; h.replace(); if last then h.output(dataset:'node'); drop node; run; data want(keep=node household); declare hash ha(ordered:'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('last'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedone(); if 0 then set full; declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y'); from_to.definekey('from'); from_to.definedata('to'); from_to.definedone(); if 0 then set node; declare hash no(dataset:'node'); declare hiter hi_no('no'); no.definekey('node'); no.definedata('node'); no.definedone(); do while(hi_no.next()=0); household+1; output; count=1; key=node;_ha.add(); last=node;ha.add(); rc=hi.first(); do while(rc=0); from=last;rx=from_to.find(); do while(rx=0); key=to;ry=_ha.check(); if ry ne 0 then do; node=to;output;rr=no.remove(key:node); key=to;_ha.add(); count+1; last=to;ha.add(); end; rx=from_to.find_next(); end; rc=hi.next(); end; ha.clear();_ha.clear(); end; stop; run;
11-01-2016 10:13 AM
@Ksharp, wow this works perfectly.
I'll try this on the dataset that I have. Since I am not familiar with creating hash objects in SAS, I'll have to figure out the logic behind the code.
Thank you very much!