BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
altijani
Quartz | Level 8

Greetings Community,

 

I have this dataset:

 

Client_IDAddress
ABC12311 East Street, City1, OH
ABC12311 East Street, City1, OH County 5
XFE51D100 Main Street, City2, OH
XFE51D100 Main Street, City2, OH

 

I need to keep only ONE observation for each client with the following conditions:

- If the addresses are different, keep the longest address

- if the addresses are the same, keep only one.

 

My final dataset should look like this:

 

Client_IDAddress
ABC12311 East Street, City1, OH County 5
XFE51D100 Main Street, City2, OH

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If you want to use LENGTH as a criteria you need to add it to your dataset.

data want ;
  set have;
  a_length=length(address);
run;

proc sort;
  by client_id a_length;
run;

data want;
  set want;
  by client_id;
  if last.client_id;
run;

View solution in original post

8 REPLIES 8
novinosrin
Tourmaline | Level 20

data have;
input Client_ID $	Address $50.;
cards;
ABC123	11 East Street, City1, OH
ABC123	11 East Street, City1, OH County 5
XFE51D	100 Main Street, City2, OH
XFE51D	100 Main Street, City2, OH
;

proc sql;
create table want  as
select distinct *
from have
group by client_id
having max(length(address))=length(address);
quit;
 
Tom
Super User Tom
Super User

If you want to use LENGTH as a criteria you need to add it to your dataset.

data want ;
  set have;
  a_length=length(address);
run;

proc sort;
  by client_id a_length;
run;

data want;
  set want;
  by client_id;
  if last.client_id;
run;
Astounding
PROC Star

There are some tricky situations to consider, both in the planning and execution.  Not all solutions will return the same result for:

 

Client_ID	Address
ABC123	11 East Street, City1, OH
ABC123	11 East Street, City1, OH County 5
ABC123	11 East Street, City1, OH County 5

And what would you like the result to be in cases of a tie?

Client_ID	Address
ABC123	11 East Street, City1, OH
ABC123	11 East Street, City1, OH County 4
ABC123	11 East Street, City1, OH County 5
ballardw
Super User

@Astounding wrote:

There are some tricky situations to consider, both in the planning and execution.  Not all solutions will return the same result for:

 

Client_ID	Address
ABC123	11 East Street, City1, OH
ABC123	11 East Street, City1, OH County 5
ABC123	11 East Street, City1, OH County 5

And what would you like the result to be in cases of a tie?

Client_ID	Address
ABC123	11 East Street, City1, OH
ABC123	11 East Street, City1, OH County 4
ABC123	11 East Street, City1, OH County 5

Or even more fun

Client_ID Address 
ABC123 11 East Street, City1, OH County 4 
ABC123 22 West Street, Abcde, OH OCount 9
altijani
Quartz | Level 8

Ballardw

 

Thank you for bringing that up. All these combinations are possible. I have an additional column in the data that has the Date of that record. In case of any of these possibilities, I need to pick up the latest address (based on the most recent Date).

 

So the data looks like this:

 

Client_ID Address  Date
ABC123 11 East Street, City1, OH  05/01/2018
ABC123 11 East Street, City1, OH County 5 09/09/2018
ABC123 11 East Street, City1, OH County 5 01/13/2019

 

How the code will look like now?

 

Thank you

ballardw
Super User

@altijani wrote:

Ballardw

 

Thank you for bringing that up. All these combinations are possible. I have an additional column in the data that has the Date of that record. In case of any of these possibilities, I need to pick up the latest address (based on the most recent Date).

 

So the data looks like this:

 

Client_ID Address  Date
ABC123 11 East Street, City1, OH  05/01/2018
ABC123 11 East Street, City1, OH County 5 09/09/2018
ABC123 11 East Street, City1, OH County 5 01/13/2019

 

How the code will look like now?

 

Thank you


When I have dated addresses I would just take the latest. Unless you have lots of duplicate dates with different addresses then this should work if your date variable is an actual SAS date value (numeric with a SAS format like mmddyy10.)

Proc sort data=have;
   by client_id date;
run;

data want;
   set have;
   by client_id;
   if last.client_id;
run;
Astounding
PROC Star

It sounds like you are looking for the longest value, but using DATE to break ties.  I have to assume that you know what a SAS date actually should contain, and that your data actually contains true SAS dates.  In that case, I would choose a variation on a method that was suggested earlier:

 

data want ;
  set have;
  a_length=length(address);
run;

proc sort;
  by client_id a_length date;
run;

data want;
  set want;
  by client_id;
  if last.client_id;
drop a_length; run;
altijani
Quartz | Level 8
Thank you

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1009 views
  • 7 likes
  • 5 in conversation