Solved: Re: Using COMPGED to check for vertical matches in a column

Buzzy_Bee · Posted 08-22-2020 08:30 PM

Hello,

I've got a situation where I need to compare whether the addresses for contacts at companies match. Since Joe, Will and Jane are the contacts for company 123456, and all three have the same address, then they should go into the "matching" table so I know only one information pack needs to be put together and sent to them. Due to unwanted spaces and different spellings of Street, I've corrected these problems and created a variable named Cats_Addr. However, Jane has misspelt 'Chicago' so my matching code would still say their addresses don't match and that multiple information packs should be sent, which is a waste.

I need to use COMPGED to calculate the generalised edit distance between the addresses in this list, as that would tell me Jane's address is only one character different to Joe's and Will's, so I know it is most likely the same address. Likewise, Bob and Ping should get the same information pack, except that London is misspelled, and for the last company, Berlin is misspelled.

Everything I read up on COMPGED always compares two columns next to each other, but I can't find anything where COMPGED is used to compare the address or addresses below it (checking vertically instead of horizontally, checking against every address showing for the same companyID). Please note that there are thousands of customers so I can't just transpose to check the columns that way. Any thoughts please? Thanks for your ideas.

data have;
	infile datalines dlm=',' dsd truncover;
	input CompanyID $ CustID :$20. Name :$12. Address :$15. Address2 :$10.;
	datalines;
123456,336,Joe,1 Any  Street, Chicago
123456,337,Will,1 Any St, Chicago
123456,338,Jane,1   Any St, Chicgo
99888,222,Bob,2 Other Street, London
99888,222,Ping,2 Other St, Londun
77777,123,George,1 May Street, Berlin
77777,124,Rahul,1 May St, Berlen
;
data have2;
	set have;
	length Cats_Addr $25.;
	Cats_Addr=compress(cats(Address,Address2));
	Cats_Addr=tranwrd(Cats_Addr,'Street','St');
run;	
data matching no_match;
	do until (last.companyID);
	set have2;
	by CompanyID notsorted;
		if first.CompanyID then _test=Cats_Addr;
		if Cats_Addr NE _test then _differentAddr=1;
	end;
	do until (last.companyID);
	set have2;
	by CompanyID notsorted;
		if _differentAddr then output no_match;
		else output matching;
	end;
drop _:;	
run;

Shmuel · Posted 08-23-2020 01:48 AM

Just few points to think about:

should numbers (home, entry, apartment) have same weight as alpha characters?
on creating the Cats_Addr - consider adding case test (lowcase/uppercase functions)
sending different letters according to CompanyID needs adaptation to the WHERE statement
in the last step.
you may be asked by different CustID to get separate letters, not packed together.

you may compare the COMPGED to COMPLEV function by next code:

proc sql;
   create table spelltest as
   select distinct a.companyID as companyID,
                   a.CustID as CustID1,
                   b.CustID as CustID2,
                   a.Name as Name1,
                   b.Name as Name2,
                   a.Cats_Addr as Cats_Addr1, 
                   b.Cats_Addr as Cats_Addr2
     ,compged(a.Cats_Addr,b.Cats_Addr) as compged_v
     ,complev(a.Cats_Addr,b.Cats_Addr) as complev_v
   /*  ,' ' as useA,' ' as useB */
   from no_match as a, no_match as b
   where a.companyID = b.companyID  and
         a.Cats_Addr lt b.Cats_Addr and 
         compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;

View solution in original post

LeonidBatkhan · Posted 08-22-2020 09:31 PM

Hi Buzzy_Bee,

You can use LAG() function to retrieve a previous observation's value into current data step iteration, then you can apply COMPGED function to compare the 2 values.

Or you can use coding technique described in my blog post Hopping for the best - calculations across SAS dataset observations .

Hope this helps.

➤ Leonid's SAS blog

Buzzy_Bee · Posted 08-22-2020 09:51 PM

Thanks for the idea, Leonid. I saw the Lag function used once in one of the SAS Community Challenges they used to post on this site last year. I'll read up on it.

Shmuel · Posted 08-22-2020 09:33 PM

Try next code. You may need calibrate the compged result value and adapt it (600 in this code)

proc sql;
   create table spelltest as
   select distinct a.Cats_Addr as Cats_AddrA, 
                   b.Cats_Addr as Cats_AddrB
     ,compged(a.Cats_Addr,b.Cats_Addr) as spelldif
     ,' ' as useA,' ' as useb
   from no_match as a, no_match as b
   where a.Cats_Addr lt b.Cats_Addr and 
        compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;

Buzzy_Bee · Posted 08-22-2020 09:42 PM

That's a really good idea, Shmuel. I've modified your code slightly to bring back the CompanyID as well as the address for comparison, as I need to see both fields. I've just pasted my modification below for other people who read this post in the future in case they are trying to solve something like this.

proc sql;
   create table spelltest as
   select distinct a.companyID as companyID, 
                   a.Cats_Addr as Cats_AddrA, 
                   b.Cats_Addr as Cats_AddrB
     ,compged(a.Cats_Addr,b.Cats_Addr) as spelldif
     ,' ' as useA,' ' as useb
   from no_match as a, no_match as b
   where a.Cats_Addr lt b.Cats_Addr and 
        compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;

SASKiwi · Posted 08-22-2020 11:55 PM

@Buzzy_Bee - Not sure if you are aware of this but SAS Data Quality (aka DataFlux) contains address matching capabilities that takes the sorts of problems you have and scales them up to provide an enterprise-wide data matching and standardization solutions. It also contains customizations for the address peculiarities of different countries.

Unlikely to be relevant to your requirements, but I thought it would be useful to point out for others.

Buzzy_Bee · Posted 08-23-2020 12:08 AM

Thanks SAS Kiwi, but my workplace has only given us access to SAS EG. Isn't SAS Data Quality an extra package that costs more? I think in the future, the company will look at offering us SAS Viya as an option, but I don't know about other packages or add ons. Currently they only supply SAS EG and then I use SAS Studio at home when I'm practicising my SAS skills.

SASKiwi · Posted 08-23-2020 12:14 AM

@Buzzy_Bee - Yes, SAS Data Quality is not included in Base SAS and is a separately-licensed product. It is really for organisations wanting enterprise-wide data quality solutions.

Shmuel · Posted 08-23-2020 01:48 AM

Just few points to think about:

should numbers (home, entry, apartment) have same weight as alpha characters?
on creating the Cats_Addr - consider adding case test (lowcase/uppercase functions)
sending different letters according to CompanyID needs adaptation to the WHERE statement
in the last step.
you may be asked by different CustID to get separate letters, not packed together.

you may compare the COMPGED to COMPLEV function by next code:

proc sql;
   create table spelltest as
   select distinct a.companyID as companyID,
                   a.CustID as CustID1,
                   b.CustID as CustID2,
                   a.Name as Name1,
                   b.Name as Name2,
                   a.Cats_Addr as Cats_Addr1, 
                   b.Cats_Addr as Cats_Addr2
     ,compged(a.Cats_Addr,b.Cats_Addr) as compged_v
     ,complev(a.Cats_Addr,b.Cats_Addr) as complev_v
   /*  ,' ' as useA,' ' as useB */
   from no_match as a, no_match as b
   where a.companyID = b.companyID  and
         a.Cats_Addr lt b.Cats_Addr and 
         compged(a.Cats_Addr,b.Cats_Addr) < 600
   ;
quit;

Buzzy_Bee · Posted 08-23-2020 03:10 AM

Thanks for the COMPLEV suggestion. The COMPLEV scoring system is a lot easier to read than for COMPGED so I'll definitely use that one.

Thankfully the database stores all addresses in uppercase after customers enter their addresses on the webpage, so I don't have to worry about case insensitivity.

For the prefixes (unit, flat, house, apartment etc) I'm planning to swap them out for blank spaces so that "Apartment 1b 10 Any Street" will just become "1b 10 Any Street." That should improve the difference scoring.

One headache is when customers have different opinions about the suburb they're in. For example, one writes "10 Any Street Putney London" and the other writes "10 Any Street Wimbledon London" (they're neighbouring suburbs). If I use the substring function substr(), I can take the first 12 or so of my compressed address field and the last six or so on the right in order to create a string that can be compared. That's the best solution I can think of. But I welcome any other ideas 🙂

SASKiwi · Posted 08-24-2020 03:18 AM

@Buzzy_Bee - Regarding the adjacent suburb issue. One approach would be to split the addresses into component parts of street address, suburb, city and hopefully also have postcode. Now compare a street address, city and postcode string - its unlikely there are duplicate street names within a postcode.

Buzzy_Bee · Posted 08-24-2020 03:31 AM

Would you separate each using a space as the delimiter? The trouble with my addresses is that people are free to write what they want into the fields, so it looks something like this:

Name: Joe Bloggs

Address line 1: Joe Bloggs Investments

Address line 2: Level 5

Address line 3: 999 New York Street

Address line 4: Wimbledon

Address line 5: London UK 2163

So I don't have nice tidy fields named street, suburb, city, country, postcode. So one customer might write 999 New York Street in the same field that the other person writes Wimbledon. There is no consistent suburb field I can remove, unfortunately.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away