BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Mscarboncopy
Pyrite | Level 9

I have a file with 20 different date vars and I would need to remove any duplicate entry based on  the duplicate date per ID. I'm using 5 IDS and 4 dates for example. I want to flag the duplicates per ID (in bold here) -  so I can remove them. I can't remove them automatically I would have to see what is in the rest of data to make a decision (case by case) The  duplicates are usually side by side but it could be anywhere in the file.

Data example;
input id DATE1 DATE2 DATE3 DATE4;
cards;
10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/19999
01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
;
Run;

 

Thanks !

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@Mscarboncopy wrote:

My dates are date vars not characters.  My input had issue apologies. Yes, I meant  the way  you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I  will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.


You need to provide at least one example of the ID with the duplicated dates from different records.

Then show what you expect to see for a result.

If you don't want us to remove them then we need to see what you are expecting us to provide.

 

Here is one way to get ID with one(or more dates) duplicated across all variables and multiple observations. Note the addition of another ID=3 record to create a duplicate in another observation.

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
3 12/09/1998   02/02/1999 04/07/1999  05/07/1999
;

data need;
   set example;
   array d (*) date: ;
   do i=1 to dim(d);
      date= d[i];
      output;
   end;
   format date mmddyy10.;
   keep id date;
run;

proc summary data=need nway;
   class id date;
   output out=want (drop=_type_ where=(_freq_>1));
run;

The result data set Want has the Id and duplicated date(s) and a variable _freq_ that indicates how many times that date appears.

 

If you have LOTS of duplicates that result data set may need to be filtered. But since you have not provided anything resembling what the output should look like this is my best guess.

 

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

Please show us the desired output.

 

When you say "remove them" do you mean remove the entire record, or remove the duplicate date, or something else?

--
Paige Miller
Mscarboncopy
Pyrite | Level 9

 

1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004

remove the date  but I don't need a code to remove them I need a code that will tell me which Ids have duplicated dates. ID 1 here has 2 duplicates 10/20/1999 and ID4 has 2 duplicates 01/01/2001 ID 5 has two duplicates 01/08/2004 I would have to look at the rest of the data (I have several vars linked to the dates) to decide which one to keep. I would delete one of the duplicate date for those IDs and all the other vars associated with it. My desired output is the same as above without the duplicated dates for each of those 3  IDs. Thanks

PaigeMiller
Diamond | Level 26

My desired output is the same as above without the duplicated dates for each of those 3  IDs. Thanks

 

I am asking you to SHOW US what this looks like.


Also, are the duplicated dates always next to each other in the real data?

--
Paige Miller
Mscarboncopy
Pyrite | Level 9

not necessarily next to each other, most likely yes, but not always.

Hoping the caps does not mean you are yelling at me 🙂

1  10/20/1999       .        11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999    04/05/1999  05/05/1999
4  01/01/2001  02/03/2001      .         03/30/2001
5 12/19/2003   01/08/2004       .          06/10/2004

However, I don't need a code to give me this output, I need a code to give me flags, as I cannot automatically remove them,  not sure how to show flags needed but they would be showing up for those IDS above for the DATE vars where one is missing.  Thanks again.

ballardw
Super User

If you really mean "could be anywhere in the file" then you should show at least one example where the duplicate is not on the same record. All of your examples shown are on the same record and if that is the case then it is very likely that they can be "automatically" removed once you show what the expected output would be.

 

Note: your data step as shown throws invalid data errors for reading every single one of those date values (not to mention one year of 19999 which is valid but there currently aren't any date formats that will display the year correctly):

1    Data example;
2    input id DATE1 DATE2 DATE3 DATE4;
3    cards;

NOTE: Invalid data for DATE1 in line 4 4-13.
NOTE: Invalid data for DATE2 in line 4 16-25.
NOTE: Invalid data for DATE3 in line 4 27-36.
NOTE: Invalid data for DATE4 in line 4 38-47.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
4          1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
id=1 DATE1=. DATE2=. DATE3=. DATE4=. _ERROR_=1 _N_=1
NOTE: Invalid data for DATE1 in line 5 4-13.
NOTE: Invalid data for DATE2 in line 5 17-26.
NOTE: Invalid data for DATE3 in line 5 29-38.
NOTE: Invalid data for DATE4 in line 5 41-50.
5          2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
id=2 DATE1=. DATE2=. DATE3=. DATE4=. _ERROR_=1 _N_=2

Which may bring up a question of are your values actually date values or character? Or did you mean something more like:

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
;
Mscarboncopy
Pyrite | Level 9

My dates are date vars not characters.  My input had issue apologies. Yes, I meant  the way  you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I  will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.

ballardw
Super User

@Mscarboncopy wrote:

My dates are date vars not characters.  My input had issue apologies. Yes, I meant  the way  you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I  will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.


You need to provide at least one example of the ID with the duplicated dates from different records.

Then show what you expect to see for a result.

If you don't want us to remove them then we need to see what you are expecting us to provide.

 

Here is one way to get ID with one(or more dates) duplicated across all variables and multiple observations. Note the addition of another ID=3 record to create a duplicate in another observation.

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
3 12/09/1998   02/02/1999 04/07/1999  05/07/1999
;

data need;
   set example;
   array d (*) date: ;
   do i=1 to dim(d);
      date= d[i];
      output;
   end;
   format date mmddyy10.;
   keep id date;
run;

proc summary data=need nway;
   class id date;
   output out=want (drop=_type_ where=(_freq_>1));
run;

The result data set Want has the Id and duplicated date(s) and a variable _freq_ that indicates how many times that date appears.

 

If you have LOTS of duplicates that result data set may need to be filtered. But since you have not provided anything resembling what the output should look like this is my best guess.

 

Mscarboncopy
Pyrite | Level 9

Thank you. This is what I had in mind. Will try it out. Difficult to show an output in this case. How would someone show an output for flagging ? My desired output is not the removal of the duplicate dates, it is simply flagging them.

 

Mscarboncopy
Pyrite | Level 9

This gave me exactly what I needed. I appreciate the patience.

Tom
Super User Tom
Super User

It will be MUCH easier if there is only one DATE variable.

data example;
  input id (DATE1-DATE4) (:mmddyy.);
  format date1-date4 yymmdd10.;
cards;
1 10/20/1999  10/20/1999 11/30/1999 12/05/1999
2 03/30/2006  05/05/2006 10/08/2006 12/12/2006
3 12/08/1998  02/02/1999 04/05/1999 05/05/1999
4 01/01/2001  02/03/2001 01/01/2001 03/30/2001
5 12/19/2003  01/08/2004 01/08/2004 06/10/2004
;

proc transpose data=example out=tall(rename=(col1=date));
  by id;
  var date1-date4 ;
run;

proc sql;
create table want as 
  select a.*, b.flag 
  from example a
  left join 
    ( select id,max(count)>1 as FLAG 
      from (select id,date,count(*) as count 
            from tall 
            where not missing(date)
            group by id,date
            ) c
      group by id
    ) b
  on a.id = b.id
;
quit;
proc print;
run;

Tom_0-1655914530946.png

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 625 views
  • 0 likes
  • 4 in conversation