All,
I have a dataset with thousands of rows and about 45 columns and I need to clean up the duplicates.I was able to identify dedups but the challenging part is ADDR1 field. I want to compare the first few characters of the ADDR1 field from the previous sub_date to recent sub_date, and if they are closely matching(at least 90%) then I want to retain the previous ADDR1 value to the most recent record. When the ADDR1 field match then I would like to delete the old sub_date record. Is this possible?
data to_clean ;
length sub_date $10. Unit_Name $100 ADDR1 $100;
infile cards dsd dlm='|' truncover ;
input sub_date -- ADDR1 ;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;
run;
Thank you
You could use a spelling distance function such as COMPLEV:
data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;
proc sort data=to_clean; by unitName subDate; run;
data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;
proc sort data=to_clean_2; by unitName goodAddr subDate; run;
data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;
proc print noobs data=want; run;
And for your sample data set, what do you expect the desired results to look like?
Flag the previous record (old sub_date) as duplicate to delete it later.
Then my final data will be clean with one record per unit_name/addr1
You could use a spelling distance function such as COMPLEV:
data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;
proc sort data=to_clean; by unitName subDate; run;
data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;
proc sort data=to_clean_2; by unitName goodAddr subDate; run;
data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;
proc print noobs data=want; run;
Thank you so much. First time hearing about complev function. Very helpful and the code works perfectly as desired.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.