BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Buzzy_Bee
Quartz | Level 8

Hello,

I've got a situation where I need to compare whether the addresses for contacts at companies match. Since Joe, Will and Jane are the contacts for company 123456, and all three have the same address, then they should go into the "matching" table so I know only one information pack needs to be put together and sent to them. Due to unwanted spaces and different spellings of Street, I've corrected these problems and created a variable named Cats_Addr. However, Jane has misspelt 'Chicago' so my matching code would still say their addresses don't match and that multiple information packs should be sent, which is a waste.

I need to use COMPGED to calculate the generalised edit distance between the addresses in this list, as that would tell me Jane's address is only one character different to Joe's and Will's, so I know it is most likely the same address. Likewise, Bob and Ping should get the same information pack, except that London is misspelled, and for the last company, Berlin is misspelled.

Everything I read up on COMPGED always compares two columns next to each other, but I can't find anything where COMPGED is used to compare the address or addresses below it (checking vertically instead of horizontally, checking against every address showing for the same companyID). Please note that there are thousands of customers so I can't just transpose to check the columns that way. Any thoughts please? Thanks for your ideas.

data have;
	infile datalines dlm=',' dsd truncover;
	input CompanyID $ CustID :$20. Name :$12. Address :$15. Address2 :$10.;
	datalines;
123456,336,Joe,1 Any  Street, Chicago
123456,337,Will,1 Any St, Chicago
123456,338,Jane,1   Any St, Chicgo
99888,222,Bob,2 Other Street, London
99888,222,Ping,2 Other St, Londun
77777,123,George,1 May Street, Berlin
77777,124,Rahul,1 May St, Berlen
;
data have2;
	set have;
	length Cats_Addr $25.;
	Cats_Addr=compress(cats(Address,Address2));
	Cats_Addr=tranwrd(Cats_Addr,'Street','St');
run;	
data matching no_match;
	do until (last.companyID);
	set have2;
	by CompanyID notsorted;
		if first.CompanyID then _test=Cats_Addr;
		if Cats_Addr NE _test then _differentAddr=1;
	end;
	do until (last.companyID);
	set have2;
	by CompanyID notsorted;
		if _differentAddr then output no_match;
		else output matching;
	end;
drop _:;	
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Just few points to think about:

  • should numbers (home, entry, apartment) have same weight as alpha characters?
  • on creating the Cats_Addr - consider adding case test (lowcase/uppercase functions)
  • sending different letters according to CompanyID needs adaptation to the WHERE statement
    in the last step.
  • you may be asked by different CustID to get separate letters, not packed together.
  • you may compare the COMPGED to COMPLEV function by next code:
    proc sql;
       create table spelltest as
       select distinct a.companyID as companyID,
                       a.CustID as CustID1,
                       b.CustID as CustID2,
                       a.Name as Name1,
                       b.Name as Name2,
                       a.Cats_Addr as Cats_Addr1, 
                       b.Cats_Addr as Cats_Addr2
         ,compged(a.Cats_Addr,b.Cats_Addr) as compged_v
         ,complev(a.Cats_Addr,b.Cats_Addr) as complev_v
       /*  ,' ' as useA,' ' as useB */
       from no_match as a, no_match as b
       where a.companyID = b.companyID  and
             a.Cats_Addr lt b.Cats_Addr and 
             compged(a.Cats_Addr,b.Cats_Addr) < 600
       ;
    quit;

View solution in original post

11 REPLIES 11
LeonidBatkhan
Lapis Lazuli | Level 10

Hi Buzzy_Bee,

You can use LAG() function to retrieve a previous observation's value into current data step iteration, then you can apply COMPGED function to compare the 2 values. 

 

Or you can use coding technique described in my blog post Hopping for the best - calculations across SAS dataset observations .

 

Hope this helps.

Buzzy_Bee
Quartz | Level 8

Thanks for the idea, Leonid. I saw the Lag function used once in one of the SAS Community Challenges they used to post on this site last year. I'll read up on it.

Shmuel
Garnet | Level 18

Try next code. You may need calibrate the compged result value and adapt it (600 in this code)

proc sql;
   create table spelltest as
   select distinct a.Cats_Addr as Cats_AddrA, 
                   b.Cats_Addr as Cats_AddrB
     ,compged(a.Cats_Addr,b.Cats_Addr) as spelldif
     ,' ' as useA,' ' as useb
   from no_match as a, no_match as b
   where a.Cats_Addr lt b.Cats_Addr and 
        compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;  
Buzzy_Bee
Quartz | Level 8

That's a really good idea, Shmuel. I've modified your code slightly to bring back the CompanyID as well as the address for comparison, as I need to see both fields. I've just pasted my modification below for other people who read this post in the future in case they are trying to solve something like this.

 

proc sql;
   create table spelltest as
   select distinct a.companyID as companyID, 
                   a.Cats_Addr as Cats_AddrA, 
                   b.Cats_Addr as Cats_AddrB
     ,compged(a.Cats_Addr,b.Cats_Addr) as spelldif
     ,' ' as useA,' ' as useb
   from no_match as a, no_match as b
   where a.Cats_Addr lt b.Cats_Addr and 
        compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;
SASKiwi
PROC Star

@Buzzy_Bee - Not sure if you are aware of this but SAS Data Quality (aka DataFlux) contains address matching capabilities that takes the sorts of problems you have and scales them up to provide an enterprise-wide data matching and standardization solutions. It also contains customizations for the address peculiarities of different countries.

 

Unlikely to be relevant to your requirements, but I thought it would be useful to point out for others.

Buzzy_Bee
Quartz | Level 8

Thanks SAS Kiwi, but my workplace has only given us access to SAS EG. Isn't SAS Data Quality an extra package that costs more? I think in the future, the company will look at offering us SAS Viya as an option, but I don't know about other packages or add ons. Currently they only supply SAS EG and then I use SAS Studio at home when I'm practicising my SAS skills.

SASKiwi
PROC Star

@Buzzy_Bee - Yes, SAS Data Quality is not included in Base SAS and is a separately-licensed product. It is really for organisations wanting enterprise-wide data quality solutions.

Shmuel
Garnet | Level 18

Just few points to think about:

  • should numbers (home, entry, apartment) have same weight as alpha characters?
  • on creating the Cats_Addr - consider adding case test (lowcase/uppercase functions)
  • sending different letters according to CompanyID needs adaptation to the WHERE statement
    in the last step.
  • you may be asked by different CustID to get separate letters, not packed together.
  • you may compare the COMPGED to COMPLEV function by next code:
    proc sql;
       create table spelltest as
       select distinct a.companyID as companyID,
                       a.CustID as CustID1,
                       b.CustID as CustID2,
                       a.Name as Name1,
                       b.Name as Name2,
                       a.Cats_Addr as Cats_Addr1, 
                       b.Cats_Addr as Cats_Addr2
         ,compged(a.Cats_Addr,b.Cats_Addr) as compged_v
         ,complev(a.Cats_Addr,b.Cats_Addr) as complev_v
       /*  ,' ' as useA,' ' as useB */
       from no_match as a, no_match as b
       where a.companyID = b.companyID  and
             a.Cats_Addr lt b.Cats_Addr and 
             compged(a.Cats_Addr,b.Cats_Addr) < 600
       ;
    quit;
Buzzy_Bee
Quartz | Level 8

Thanks for the COMPLEV suggestion. The COMPLEV scoring system is a lot easier to read than for COMPGED so I'll definitely use that one.

Thankfully the database stores all addresses in uppercase after customers enter their addresses on the webpage, so I don't have to worry about case insensitivity. 

For the prefixes (unit, flat, house, apartment etc) I'm planning to swap them out for blank spaces so that "Apartment 1b 10 Any Street" will just become "1b 10 Any Street." That should improve the difference scoring.

 

One headache is when customers have different opinions about the suburb they're in. For example, one writes "10 Any Street Putney London" and the other writes "10 Any Street Wimbledon London" (they're neighbouring suburbs). If I use the substring function substr(), I can take the first 12 or so of my compressed address field and the last six or so on the right in order to create a string that can be compared. That's the best solution I can think of. But I welcome any other ideas 🙂

SASKiwi
PROC Star

@Buzzy_Bee  - Regarding the adjacent suburb issue. One approach would be to split the addresses into component parts of street address, suburb, city and hopefully also have postcode. Now compare a street address, city and postcode string  - its unlikely there are duplicate street names within a postcode.

Buzzy_Bee
Quartz | Level 8

Would you separate each using a space as the delimiter? The trouble with my addresses is that people are free to write what they want into the fields, so it looks something like this:

Name: Joe Bloggs

Address line 1: Joe Bloggs Investments

Address line 2: Level 5

Address line 3: 999 New York Street

Address line 4: Wimbledon

Address line 5: London UK 2163

 

So I don't have nice tidy fields named street, suburb, city, country, postcode. So one customer might write 999 New York Street in the same field that the other person writes Wimbledon. There is no consistent suburb field I can remove, unfortunately. 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1555 views
  • 4 likes
  • 4 in conversation