BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Stalk
Pyrite | Level 9

All,

I have a dataset with thousands of rows and about 45 columns and I need to clean up the duplicates.I was able to identify dedups but the challenging part is ADDR1 field. I want to compare the first few characters of the ADDR1 field from the previous sub_date to recent sub_date, and if they are closely matching(at least 90%) then I want to retain the previous ADDR1 value to the most recent record. When the ADDR1 field match then I would like to delete the old sub_date record. Is this possible?

data to_clean ;

length sub_date $10. Unit_Name $100 ADDR1 $100;

infile cards dsd dlm='|' truncover ;

input sub_date -- ADDR1 ;

cards;

 

11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN

10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET

10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET

10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET

11/5/2020|BROOKDALE |

10/29/2020|BROOKDALE|2401 COUGAR AVENUE

10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE

11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE

11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY

10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE

11/21/2020|MISSION AT THE VILLA|1445 UINTA

11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE

;

run;

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could use a spelling distance function such as COMPLEV:

 

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PGStats_0-1606281354131.png

 

PG

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

And for your sample data set, what do you expect the desired results to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stalk
Pyrite | Level 9

Flag the previous record (old sub_date) as duplicate to delete it later.

Then my final data will be clean with one record per unit_name/addr1

PGStats
Opal | Level 21

You could use a spelling distance function such as COMPLEV:

 

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PGStats_0-1606281354131.png

 

PG
Stalk
Pyrite | Level 9

Thank you so much. First time hearing about complev function. Very helpful and the code works perfectly as desired.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1164 views
  • 0 likes
  • 3 in conversation