How to identify similar observations which were entered differently in SAS?

Reply
Contributor krm
Contributor
Posts: 26

How to identify similar observations which were entered differently in SAS?

The question is about identifying the similar observations in SAS. 

 

In the main dataset that I have there are dealers with multiple ids and similar addresses. Because the addresses were entered manually, in many cases the addresses don't match. Here is an example below:

 

Is there a way in SAS to identify these entries and create a new entry (New_Dealer_ID) ?

 

I cannot hard-code it because the dataset is bigger and there are variations in how the addresses were entered. 

 

 

New_Dealer_IDDealer_idAddress_1                       
A11Rome 1020
A12Rome 1020P
A23Paris-SuiteB
A24Paris SuiteB

 

 

I also found a couple of Fuzzy-matching documentation on the internet but they look way more complicated compared to what I am trying to do. 

 

Thank you for your help

Super User
Posts: 11,101

Re: How to identify similar observations which were entered differently in SAS?

How many records are involved? If the number is massive you may need some pretty complex tools, if there are a few hundred then not so many.

 

There is an application from CDC called Link-Plus that does some name/address/other characteristic matching and assingns a "match likelihood" type score.

Contributor krm
Contributor
Posts: 26

Re: How to identify similar observations which were entered differently in SAS?

@ballardw Thank you for the reply. There are more than 13K entries. 

Super User
Posts: 11,101

Re: How to identify similar observations which were entered differently in SAS?

So you have a sort-of-large (defined as too bing to spend a lot of manual time on but not in the million plus range) problem.

I would take a look at how many exact matches you can find. If the number found is is much less than 9K then other tools.

 

I have used Link plus as there is no charge for use, it works pretty well and has options that let you set thresholds for "assumed match" which may be more important for larger problem sets. You could examine it here: http://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm

 

 

Contributor krm
Contributor
Posts: 26

Re: How to identify similar observations which were entered differently in SAS?

Thank you @ballardw. I will take a look at it. 

Super User
Posts: 18,997

Re: How to identify similar observations which were entered differently in SAS?

Do you have a way to limit comparisons, for example a field that is reliable, city/town? Otherwise that's a heck of a lot of comparisons that need to occur.

Super User
Posts: 9,856

Re: How to identify similar observations which were entered differently in SAS?

Fuzz matching is really uneasy.





data x;
input Dealer_id	Address_1  $20.;
cards;                    
1 Rome 1020
2 Rome 1020P
3 Paris-SuiteB
4 Paris SuiteB
;
run;

proc sql;
create table have as 
 select a.address_1 as from,b.address_1 as to
  from x as a,x as b 
   where a.address_1 ne b.address_1
    group by a.address_1
      having spedis(a.address_1,b.address_1)=
          min(spedis(a.address_1,b.address_1));
quit;

data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;




Super User
Posts: 9,856

Re: How to identify similar observations which were entered differently in SAS?

[ Edited ]

deleted .

Super User
Posts: 9,856

Re: How to identify similar observations which were entered differently in SAS?

Use my first post. I am not able to delete the second post.
Contributor krm
Contributor
Posts: 26

Re: How to identify similar observations which were entered differently in SAS?

@Ksharp, wow this works perfectly. 

 

I'll try this on the dataset that I have. Since I am not familiar with creating hash objects in SAS, I'll have to figure out the logic behind the code. 

 

Thank you very much!

 

Ask a Question
Discussion stats
  • 9 replies
  • 387 views
  • 1 like
  • 4 in conversation