BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Stalk
Pyrite | Level 9

All,

I have a dataset with thousands of rows and about 45 columns and I need to clean up the duplicates.I was able to identify dedups but the challenging part is ADDR1 field. I want to compare the first few characters of the ADDR1 field from the previous sub_date to recent sub_date, and if they are closely matching(at least 90%) then I want to retain the previous ADDR1 value to the most recent record. When the ADDR1 field match then I would like to delete the old sub_date record. Is this possible?

data to_clean ;

length sub_date $10. Unit_Name $100 ADDR1 $100;

infile cards dsd dlm='|' truncover ;

input sub_date -- ADDR1 ;

cards;

 

11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN

10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET

10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET

10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET

11/5/2020|BROOKDALE |

10/29/2020|BROOKDALE|2401 COUGAR AVENUE

10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE

11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE

11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY

10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE

11/21/2020|MISSION AT THE VILLA|1445 UINTA

11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE

;

run;

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could use a spelling distance function such as COMPLEV:

 

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PGStats_0-1606281354131.png

 

PG

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

And for your sample data set, what do you expect the desired results to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stalk
Pyrite | Level 9

Flag the previous record (old sub_date) as duplicate to delete it later.

Then my final data will be clean with one record per unit_name/addr1

PGStats
Opal | Level 21

You could use a spelling distance function such as COMPLEV:

 

data to_clean;
infile cards dlm='|' truncover ;
input subDate :mmddyy10. unitName :$100. ADDR1 :$100. ;
format subDate yymmdd10.;
cards;
11/21/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN
10/30/2020|BAIRD HUDSON ENTERPRISES|106 E MAIN STREET
10/30/2020|BIG HORN ENTERPRISE|146 S. BENT STREET
10/30/2020|BIG HORN ENTERPRISE|641 WARREN STREET
11/5/2020|BROOKDALE |
10/29/2020|BROOKDALE|2401 COUGAR AVENUE
10/30/2020|ELMCROFT|1551 SUGARLAND DRIVE
11/2/2020|ELMCROFT|1551 SUGARLAND DRIVE DRIVE
11/21/2020|GREEN HOUSE LIVING|2311 SHIRLEY
10/29/2020|GREEN HOUSE LIVING|2311 SHIRLEY COVE
11/21/2020|MISSION AT THE VILLA|1445 UINTA
11/2/2020|MISSION AT THE VILLA|1445 UINTA DRIVE
;

proc sort data=to_clean; by unitName subDate; run;

data to_clean_2;
set to_clean;
by unitName;
length goodAddr $100;
retain goodAddr;
if first.unitName then goodAddr = addr1;
else if complev(trim(goodAddr), trim(addr1), "IL:") > 2 then goodAddr = addr1;
run;

proc sort data=to_clean_2; by unitName goodAddr subDate; run;

data want;
set to_clean_2;
by unitName goodAddr;
if last.goodAddr;
run;

proc print noobs data=want; run;

PGStats_0-1606281354131.png

 

PG
Stalk
Pyrite | Level 9

Thank you so much. First time hearing about complev function. Very helpful and the code works perfectly as desired.

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1033 views
  • 0 likes
  • 3 in conversation