SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

De-duplicating Addresses

Reply
Occasional Contributor
Posts: 5

De-duplicating Addresses

[ Edited ]

I have a large dataset with addresses where each individual has multiple addresses. I am using a ArcGIS to geocode these addresses but I am first trying to deduplicate the addresses in SAS. Sometime the addresses are actually different addresses and sometime they are the same address repeated, with time indicator for each address. In the example observations below, I would like all three rows to collapse to one row. I was wondering if anyone on here has ever done this (and save me some time and headaches). At the bottom you can see the code I have started for this. I made the dataset horizontal, so that each individual has one row and mulitple addresses (ending in _1 - _17). I was trying to match with the COMPGED function. If a person moved from address A then to B and back to A I would not want the addresses to collapse, as the time variables will show that the individual was at address A from two separate time periods and address B during a time period. Once I have sufficiently deduplicated the addresses, I will make the dataset long again and the export to geocode.

 

Raw data:

Row   ID     Address                       City               State      Zipcode      DateBegin      DateEnd

1         1      123 Northbay Lane     Sometown      CA         12345         1/1/2015         2/28/2015

2         1      123 Northbay Ln         Sometown      CA         12345         3/1/2015         11/30/2016

3         1      123 Nrthby Lane         Sometown      CA         12345         12/1/2016       3/10/2017

 

Change to:

Row   ID     Address                       City               State      Zipcode      DateBegin      DateEnd

1         1      123 Northbay Lane     Sometown      CA         12345         1/1/2015         3/10/2017

 

In this example I just show one example for one person but this could happen several times for one person then the data sets are several hundred thousand participants so I can't manually make the changes. Let me know if I can explain anything.

Any advice would be greatly appreciated!

 - Claire

 

Code I have started:

        f_add_1         = todedup_add_1;

do i = 2 to 17;
do j = 2 to 16;
    if todedup_add_(i) ne ' ' then do;            
            if COMPGED(todedup_add_(i-1),todedup_add_(i), ':') = 0 then continue; /* MATCH */
            else if (0 < COMPGED(todedup_add_(i-1),todedup_add_(i), ':') <= 500) and (0 = COMPGED(house_num_(i-1),house_num_(i), ':')) then continue; /* MATCH */
            else if (500 < COMPGED(todedup_add_(i-1),todedup_add_(i), ':') <= 600) and (0 = COMPGED(house_num_(i-1),house_num_(i), ':')) and (0 = COMPGED(zip_(i-1),zip_(i), ':')) then continue; /* MATCH, if the house number is identical and the zipcode then it is a match */
            else if 600 < COMPGED(todedup_add_(i-1),todedup_add_(i), ':') then do; /* NOT A MATCH; MAKE A NEW FINAL ADDRESS */
                f_add_(j) = todedup_add_(i);    
                if i = incoming_row_count then leave;
                i=i+1;
            end;
        end;
    end;
end;

Super User
Posts: 10,483

Re: De-duplicating Addresses

How large is "large"?

 

If this is 1000's of records or larger I would suggest finding a matching program you like that could look at addresses.

I have used one that is available free from CDC called LinkPlus. You can find download and documentation here: https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm

 

There may well be more options than you need but I think it may do what you want as far as Identifying the "duplicate" addresses.

One of the features let you set a match threshold on a probability of match parameter. So play with that to find the ones that don't match and reduce some of the work load.

Likely you would take the output from that after some massaging do something similar to:

proc summary data=have nway;
  class id address city state zipcode;
  var datebegin dateend;
  output out=collapsed(drop= _:) min(datebegin)= Max(dateend=);
run;

To combine the date periods.

 

Occasional Contributor
Posts: 5

Re: De-duplicating Addresses

[ Edited ]

Hi @ballardw

 

Large is more than 200,000 people with an average of around 5-6 raw addresses.  Thank you for the link for the LinkPlus, I will look it over today and see if it will work. I think we will need a free software so hopefully this will fit our needs.

 

I have noticed that a porportion of the people move back and forth with their addresses (see below).  I think the example code that you provided would collapse any of the same address for each person? From the exampel below would your code make Pineapple Lane from 1/1/2015 - 5/1/2017 and Orange Street 12/1/2016 - 3/10/2017? In the end I wanted to keep if they were moving back and forth (so showing that they went from Pineapple Lane to Orange Street back to Pineapple Lane).

 

Raw data:

Row   ID     Address                         City               State       Zipcode       DateBegin      DateEnd

1         1      123 Pineapple Lane     Sometown       CA          12345         1/1/2015         2/28/2015

2         1      123 Pineapple Lane     Sometown       CA          12345         3/1/2015         11/30/2016

3         1      456 Orange Street       Othertown       CA          54321         12/1/2016       3/10/2017

4         1      123 Pineapple Lane     Sometown       CA          12345         3/10/2017       5/1/2017

 

Change to:

Row   ID     Address                       City               State        Zipcode          DateBegin      DateEnd

1         1      123 Pineapple Lane     Sometown       CA          12345         1/1/2015         11/30/2016

2         1      456 Orange Street       Othertown       CA          54321         12/1/2016       3/10/2017

3         1      123 Pineapple Lane     Sometown       CA          12345         3/10/2017       5/1/2017

Super User
Posts: 3,102

Re: De-duplicating Addresses

In addition to @ballardw's suggestions SAS's Dataflux product is designed to do exactly what you doing plus a lot more. If this is a one-off exercise then a free tool would be preferable but if your company is interested it an enterprise DQ and matching tool Dataflux is definitely worth checking out.

Occasional Contributor
Posts: 5

Re: De-duplicating Addresses

Thank you @SASKiwifor letting me know about Dataflux! I will let my supervisor know about this and the free CDC software. I am not sure if we are allowed to buy a product to fix this task as it if (hopefully) be a one time data cleaning.

Ask a Question
Discussion stats
  • 4 replies
  • 132 views
  • 1 like
  • 3 in conversation