Solved: Re: deleting duplicate ob in consecutive date

EM_G · Posted 04-27-2021 12:47 PM

Hi All,

I have a data set where every observation is repeated twice on consecutive days and I would like to delete the first observation. For example:

id group treatmentdate day month year

A1 0 30Sep2017 30 9 2017

A1 0 1Oct2017 1 10 2017

A2 1 06Nov2017 6 11 2017

A2 1 07Nov2017 7 11 2017

A1 0 23Oct2017 23 10 2017

A1 0 24Oct2017 24 10 2017

What I would like:

id group treatmentdate day month year

A1 0 01Oct2017 1 10 2017

A2 1 07Nov2017 7 11 2017

A1 0 24Oct2017 24 10 2017

I was attempting to use

proc sql;

create table want as

select *

from have

group by id group

having day=max(day)

run;

but this will not work as I have nothing more to group by to retain the same id from the same group with two different treatments (of two consecutive days).

Any suggestions please?

I have looked at using a DoW loop but haven't been successful.

Patrick · Posted 04-29-2021 03:51 AM

If I understand what you describe and want then code like below should do the job.

data have;
  input (id group) ($) treatmentdate:date9. day month year;
  format treatmentdate date9.;
  datalines;
A1 0 30Sep2017 30 9 2017
A1 0 1Oct2017 1 10 2017
A2 1 06Nov2017 6 11 2017
A2 1 07Nov2017 7 11 2017
A1 0 23Oct2017 23 10 2017
A1 0 24Oct2017 24 10 2017
;

proc sort data=have out=inter;
  by id group DESCENDING treatmentdate;
run;

data want;
  set inter;
  by id group DESCENDING treatmentdate;
  if dif1(treatmentdate)=-1 and not first.group then delete;
run;

NB: When using the DIF or LAG function make sure you use it in a place where it gets executed for every single row of your input data.

View solution in original post

PaigeMiller · Posted 04-27-2021 12:51 PM

Given the condition that

every observation is repeated twice on consecutive days and I would like to delete the first observation

here's the solution:

data want;
    set have;
    if mod(_n_,2)=0;
run;

--
Paige Miller

AMSAS · Posted 04-27-2021 12:56 PM

Take a look at Grouping Data and FIRST. and LAST. DATA Step Variables
That should do what you want

Reeza · Posted 04-27-2021 01:12 PM

Every record will always have two entries? It also looks like you may have multiple entries for a single ID as A1 ID is repeated twice?

data want;
set have;
by ID group NOTSORTED;
if first.group;
run;

@EM_G wrote:

Hi All,

I have a data set where every observation is repeated twice on consecutive days and I would like to delete the first observation. For example:

id group treatmentdate day month year

A1 0 30Sep2017 30 9 2017

A1 0 1Oct2017 1 10 2017

A2 1 06Nov2017 6 11 2017

A2 1 07Nov2017 7 11 2017

A1 0 23Oct2017 23 10 2017

A1 0 24Oct2017 24 10 2017

What I would like:

id group treatmentdate day month year

A1 0 01Oct2017 1 10 2017

A2 1 07Nov2017 7 11 2017

A1 0 24Oct2017 24 10 2017

I was attempting to use

proc sql;

create table want as

select *

from have

group by id group

having day=max(day)

run;

but this will not work as I have nothing more to group by to retain the same id from the same group with two different treatments (of two consecutive days).

Any suggestions please?

I have looked at using a DoW loop but haven't been successful.

EM_G · Posted 04-29-2021 02:54 AM

Hi Reeza

Not every record has two entries and yes there are some multiple entries for id within a group and id may appear in multiple groups.

Kurt_Bremser · Posted 04-29-2021 03:11 AM

What if you have more than two entries for a given id/group combination? What if the dates are not consecutive? What if there's multiple obs for a single date?

Before we can come up with a reliable solution, this (and probably other) facts about your data need to be cleared up.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

EM_G · Posted 04-29-2021 03:28 AM

Facts about the data:
Every treatment occurs for two consecutive days (apart from <5% which is more than likely data input error. I will look at removing these entries).
There are some id/group combination that occur multiple times (The two day treatment is consecutive, but the repeated treatment is a random date).
There are multiple id's with the same and different id/group combinations on the same date.
Most id's progress through ascending groups.
Data is over 3 years and approx 50,000 obs (including duplicates on consecutive day).

Can I some how make a new variable that contains the second treatment only, then remove the initial variable containing the two?
I was thinking using if then statement, but unsure how to narrow it down to individual id/group combination.

All help is much appreciated.

Cheers
Em

Patrick · Posted 04-29-2021 03:51 AM

If I understand what you describe and want then code like below should do the job.

data have;
  input (id group) ($) treatmentdate:date9. day month year;
  format treatmentdate date9.;
  datalines;
A1 0 30Sep2017 30 9 2017
A1 0 1Oct2017 1 10 2017
A2 1 06Nov2017 6 11 2017
A2 1 07Nov2017 7 11 2017
A1 0 23Oct2017 23 10 2017
A1 0 24Oct2017 24 10 2017
;

proc sort data=have out=inter;
  by id group DESCENDING treatmentdate;
run;

data want;
  set inter;
  by id group DESCENDING treatmentdate;
  if dif1(treatmentdate)=-1 and not first.group then delete;
run;

NB: When using the DIF or LAG function make sure you use it in a place where it gets executed for every single row of your input data.

EM_G · Posted 04-30-2021 08:55 PM

Patrick,

Thank you so much, this has worked. Would you mind explaining the 'not first.group'? I have read about this in the sas documents but have not got my head around it. What did it prevent from happening?

Thanks again for your time.

Patrick · Posted 05-01-2021 12:04 AM

@EM_G wrote:
Patrick,

Thank you so much, this has worked. Would you mind explaining the 'not first.group'? I have read about this in the sas documents but have not got my head around it. What did it prevent from happening?

Thanks again for your time.

The first and last syntax in a data step allows you to identify the first or last row in a tuple.

You only want to "de-dupe" within the same group for an id.

The syntax used compares the current treatment date to the treatment date of the previous obs also when this previous obs belonged to a different group. Nothing prevents this previous treatment date value to be by chance exactly one day earlier than the current one in the first obs of a new group . Using and not first.group ensures that the first obs of a new group always gets kept even if the treatment date comparison would return a -1.

Btw: Expression dif1(treatmentdate)=-1 must be the first element in the if condition to ensure it gets executed for every single row. If a logical expression gets evaluated to False then SAS won't necessarily execute later expression under the same if statement (i.e. if with two expressions connected via AND the first expression resolved to False then there is no need to even execute the second test).

EM_G · Posted 05-02-2021 09:23 PM

Thank you for the explanation and your time.

Registration is open

SAS Training: Just a Click Away