Solved: Retain a value from previous obsevation by comparing first 10 characte...

Stalk · Posted 11-24-2020 07:53 PM

All,

I have a dataset with thousands of rows and about 45 columns and I need to clean up the duplicates.I was able to identify dedups but the challenging part is ADDR1 field. I want to compare the first few characters of the ADDR1 field from the previous sub_date to recent sub_date, and if they are closely matching(at least 90%) then I want to retain the previous ADDR1 value to the most recent record. When the ADDR1 field match then I would like to delete the old sub_date record. Is this possible?

data to_clean ;

length sub_date $10. Unit_Name $100 ADDR1 $100;

infile cards dsd dlm='|' truncover ;

input sub_date -- ADDR1 ;

cards;

11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN

10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET

10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET

10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET

11/5/2020|BROOKDALE |

10/29/2020|BROOKDALE|2401 COUGAR AVENUE

10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE

11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE

11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY

10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE

11/21/2020|MISSION AT THE VILLA|1445 UINTA

11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE

;

run;

Thank you

PGStats · Posted 11-25-2020 12:17 AM

You could use a spelling distance function such as COMPLEV:

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PG

View solution in original post

mkeintz · Posted 11-24-2020 08:53 PM

And for your sample data set, what do you expect the desired results to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Stalk · Posted 11-24-2020 10:24 PM

Flag the previous record (old sub_date) as duplicate to delete it later.

Then my final data will be clean with one record per unit_name/addr1

PGStats · Posted 11-25-2020 12:17 AM

You could use a spelling distance function such as COMPLEV:

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PG

Stalk · Posted 11-25-2020 12:34 PM

Thank you so much. First time hearing about complev function. Very helpful and the code works perfectly as desired.

Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Registration is open

Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Re: Retain a value from previous obsevation by comparing first 10 characters of address field...

Registration is open

SAS Training: Just a Click Away