Greetings Community,
I have this dataset:
Client_ID | Address |
ABC123 | 11 East Street, City1, OH |
ABC123 | 11 East Street, City1, OH County 5 |
XFE51D | 100 Main Street, City2, OH |
XFE51D | 100 Main Street, City2, OH |
I need to keep only ONE observation for each client with the following conditions:
- If the addresses are different, keep the longest address
- if the addresses are the same, keep only one.
My final dataset should look like this:
Client_ID | Address |
ABC123 | 11 East Street, City1, OH County 5 |
XFE51D | 100 Main Street, City2, OH |
Thank you!
If you want to use LENGTH as a criteria you need to add it to your dataset.
data want ;
set have;
a_length=length(address);
run;
proc sort;
by client_id a_length;
run;
data want;
set want;
by client_id;
if last.client_id;
run;
data have;
input Client_ID $ Address $50.;
cards;
ABC123 11 East Street, City1, OH
ABC123 11 East Street, City1, OH County 5
XFE51D 100 Main Street, City2, OH
XFE51D 100 Main Street, City2, OH
;
proc sql;
create table want as
select distinct *
from have
group by client_id
having max(length(address))=length(address);
quit;
If you want to use LENGTH as a criteria you need to add it to your dataset.
data want ;
set have;
a_length=length(address);
run;
proc sort;
by client_id a_length;
run;
data want;
set want;
by client_id;
if last.client_id;
run;
There are some tricky situations to consider, both in the planning and execution. Not all solutions will return the same result for:
Client_ID Address
ABC123 11 East Street, City1, OH
ABC123 11 East Street, City1, OH County 5
ABC123 11 East Street, City1, OH County 5
And what would you like the result to be in cases of a tie?
Client_ID Address
ABC123 11 East Street, City1, OH
ABC123 11 East Street, City1, OH County 4
ABC123 11 East Street, City1, OH County 5
@Astounding wrote:
There are some tricky situations to consider, both in the planning and execution. Not all solutions will return the same result for:
Client_ID Address ABC123 11 East Street, City1, OH ABC123 11 East Street, City1, OH County 5 ABC123 11 East Street, City1, OH County 5
And what would you like the result to be in cases of a tie?
Client_ID Address ABC123 11 East Street, City1, OH ABC123 11 East Street, City1, OH County 4 ABC123 11 East Street, City1, OH County 5
Or even more fun
Client_ID Address ABC123 11 East Street, City1, OH County 4 ABC123 22 West Street, Abcde, OH OCount 9
Ballardw
Thank you for bringing that up. All these combinations are possible. I have an additional column in the data that has the Date of that record. In case of any of these possibilities, I need to pick up the latest address (based on the most recent Date).
So the data looks like this:
Client_ID Address Date
ABC123 11 East Street, City1, OH 05/01/2018
ABC123 11 East Street, City1, OH County 5 09/09/2018
ABC123 11 East Street, City1, OH County 5 01/13/2019
How the code will look like now?
Thank you
@altijani wrote:
Ballardw
Thank you for bringing that up. All these combinations are possible. I have an additional column in the data that has the Date of that record. In case of any of these possibilities, I need to pick up the latest address (based on the most recent Date).
So the data looks like this:
Client_ID Address Date
ABC123 11 East Street, City1, OH 05/01/2018
ABC123 11 East Street, City1, OH County 5 09/09/2018
ABC123 11 East Street, City1, OH County 5 01/13/2019
How the code will look like now?
Thank you
When I have dated addresses I would just take the latest. Unless you have lots of duplicate dates with different addresses then this should work if your date variable is an actual SAS date value (numeric with a SAS format like mmddyy10.)
Proc sort data=have; by client_id date; run; data want; set have; by client_id; if last.client_id; run;
It sounds like you are looking for the longest value, but using DATE to break ties. I have to assume that you know what a SAS date actually should contain, and that your data actually contains true SAS dates. In that case, I would choose a variation on a method that was suggested earlier:
data want ;
set have;
a_length=length(address);
run;
proc sort;
by client_id a_length date;
run;
data want;
set want;
by client_id;
if last.client_id;
drop a_length;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.