Solved: Re: flagging same dates across different date var

Mscarboncopy · Posted 06-22-2022 09:43 AM

I have a file with 20 different date vars and I would need to remove any duplicate entry based on the duplicate date per ID. I'm using 5 IDS and 4 dates for example. I want to flag the duplicates per ID (in bold here) - so I can remove them. I can't remove them automatically I would have to see what is in the rest of data to make a decision (case by case) The duplicates are usually side by side but it could be anywhere in the file.

Data example;
input id DATE1 DATE2 DATE3 DATE4;
cards;
1 10/20/1999 10/20/1999 11/30/1999 12/05/1999
2 03/30/2006 05/05/2006 10/08/2006 12/12/2006
3 12/08/1998 02/02/1999 04/05/1999 05/05/19999
4 01/01/2001 02/03/2001 01/01/2001 03/30/2001
5 12/19/2003 01/08/2004 01/08/2004 06/10/2004
;
Run;

Thanks !

ballardw · Posted 06-22-2022 11:44 AM

@Mscarboncopy wrote:

My dates are date vars not characters. My input had issue apologies. Yes, I meant the way you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.

You need to provide at least one example of the ID with the duplicated dates from different records.

Then show what you expect to see for a result.

If you don't want us to remove them then we need to see what you are expecting us to provide.

Here is one way to get ID with one(or more dates) duplicated across all variables and multiple observations. Note the addition of another ID=3 record to create a duplicate in another observation.

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
3 12/09/1998   02/02/1999 04/07/1999  05/07/1999
;

data need;
   set example;
   array d (*) date: ;
   do i=1 to dim(d);
      date= d[i];
      output;
   end;
   format date mmddyy10.;
   keep id date;
run;

proc summary data=need nway;
   class id date;
   output out=want (drop=_type_ where=(_freq_>1));
run;

The result data set Want has the Id and duplicated date(s) and a variable _freq_ that indicates how many times that date appears.

If you have LOTS of duplicates that result data set may need to be filtered. But since you have not provided anything resembling what the output should look like this is my best guess.

View solution in original post

PaigeMiller · Posted 06-22-2022 10:05 AM

Please show us the desired output.

When you say "remove them" do you mean remove the entire record, or remove the duplicate date, or something else?

--
Paige Miller

Mscarboncopy · Posted 06-22-2022 11:07 AM

1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004

remove the date but I don't need a code to remove them I need a code that will tell me which Ids have duplicated dates. ID 1 here has 2 duplicates 10/20/1999 and ID4 has 2 duplicates 01/01/2001 ID 5 has two duplicates 01/08/2004 I would have to look at the rest of the data (I have several vars linked to the dates) to decide which one to keep. I would delete one of the duplicate date for those IDs and all the other vars associated with it. My desired output is the same as above without the duplicated dates for each of those 3 IDs. Thanks

PaigeMiller · Posted 06-22-2022 11:36 AM

My desired output is the same as above without the duplicated dates for each of those 3 IDs. Thanks

I am asking you to SHOW US what this looks like.

Also, are the duplicated dates always next to each other in the real data?

--
Paige Miller

Mscarboncopy · Posted 06-22-2022 12:01 PM

not necessarily next to each other, most likely yes, but not always.

Hoping the caps does not mean you are yelling at me 🙂

1  10/20/1999       .        11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999    04/05/1999  05/05/1999
4  01/01/2001  02/03/2001      .         03/30/2001
5 12/19/2003   01/08/2004       .          06/10/2004

However, I don't need a code to give me this output, I need a code to give me flags, as I cannot automatically remove them, not sure how to show flags needed but they would be showing up for those IDS above for the DATE vars where one is missing. Thanks again.

ballardw · Posted 06-22-2022 10:27 AM

If you really mean "could be anywhere in the file" then you should show at least one example where the duplicate is not on the same record. All of your examples shown are on the same record and if that is the case then it is very likely that they can be "automatically" removed once you show what the expected output would be.

Note: your data step as shown throws invalid data errors for reading every single one of those date values (not to mention one year of 19999 which is valid but there currently aren't any date formats that will display the year correctly):

1    Data example;
2    input id DATE1 DATE2 DATE3 DATE4;
3    cards;

NOTE: Invalid data for DATE1 in line 4 4-13.
NOTE: Invalid data for DATE2 in line 4 16-25.
NOTE: Invalid data for DATE3 in line 4 27-36.
NOTE: Invalid data for DATE4 in line 4 38-47.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
4          1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
id=1 DATE1=. DATE2=. DATE3=. DATE4=. _ERROR_=1 _N_=1
NOTE: Invalid data for DATE1 in line 5 4-13.
NOTE: Invalid data for DATE2 in line 5 17-26.
NOTE: Invalid data for DATE3 in line 5 29-38.
NOTE: Invalid data for DATE4 in line 5 41-50.
5          2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
id=2 DATE1=. DATE2=. DATE3=. DATE4=. _ERROR_=1 _N_=2

Which may bring up a question of are your values actually date values or character? Or did you mean something more like:

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
;

Mscarboncopy · Posted 06-22-2022 10:56 AM

My dates are date vars not characters. My input had issue apologies. Yes, I meant the way you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.

ballardw · Posted 06-22-2022 11:44 AM

@Mscarboncopy wrote:

My dates are date vars not characters. My input had issue apologies. Yes, I meant the way you redid it.

I would like to remove the dates that are duplicates for the same record or ID. I will have to write a code that does it case by case. This is not a common issue in my data file but I have 600 records (Ids) . If I created a code to give me the flags it would work perfectly.

Thank you.

You need to provide at least one example of the ID with the duplicated dates from different records.

Then show what you expect to see for a result.

If you don't want us to remove them then we need to see what you are expecting us to provide.

Here is one way to get ID with one(or more dates) duplicated across all variables and multiple observations. Note the addition of another ID=3 record to create a duplicate in another observation.

Data example;
   informat id f4. date1-date4 mmddyy10.;
   format date1-date4 mmddyy10.;
   input id DATE1 DATE2 DATE3 DATE4;
cards;
1  10/20/1999  10/20/1999 11/30/1999 12/05/1999
2  03/30/2006   05/05/2006  10/08/2006  12/12/2006
3 12/08/1998   02/02/1999 04/05/1999  05/05/1999
4  01/01/2001  02/03/2001  01/01/2001  03/30/2001
5 12/19/2003   01/08/2004   01/08/2004  06/10/2004
3 12/09/1998   02/02/1999 04/07/1999  05/07/1999
;

data need;
   set example;
   array d (*) date: ;
   do i=1 to dim(d);
      date= d[i];
      output;
   end;
   format date mmddyy10.;
   keep id date;
run;

proc summary data=need nway;
   class id date;
   output out=want (drop=_type_ where=(_freq_>1));
run;

The result data set Want has the Id and duplicated date(s) and a variable _freq_ that indicates how many times that date appears.

If you have LOTS of duplicates that result data set may need to be filtered. But since you have not provided anything resembling what the output should look like this is my best guess.

Mscarboncopy · Posted 06-22-2022 12:07 PM

Thank you. This is what I had in mind. Will try it out. Difficult to show an output in this case. How would someone show an output for flagging ? My desired output is not the removal of the duplicate dates, it is simply flagging them.

Mscarboncopy · Posted 06-22-2022 12:13 PM

This gave me exactly what I needed. I appreciate the patience.

Tom · Posted 06-22-2022 12:15 PM

It will be MUCH easier if there is only one DATE variable.

data example;
  input id (DATE1-DATE4) (:mmddyy.);
  format date1-date4 yymmdd10.;
cards;
1 10/20/1999  10/20/1999 11/30/1999 12/05/1999
2 03/30/2006  05/05/2006 10/08/2006 12/12/2006
3 12/08/1998  02/02/1999 04/05/1999 05/05/1999
4 01/01/2001  02/03/2001 01/01/2001 03/30/2001
5 12/19/2003  01/08/2004 01/08/2004 06/10/2004
;

proc transpose data=example out=tall(rename=(col1=date));
  by id;
  var date1-date4 ;
run;

proc sql;
create table want as 
  select a.*, b.flag 
  from example a
  left join 
    ( select id,max(count)>1 as FLAG 
      from (select id,date,count(*) as count 
            from tall 
            where not missing(date)
            group by id,date
            ) c
      group by id
    ) b
  on a.id = b.id
;
quit;
proc print;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!