BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
EM_G
Fluorite | Level 6

Hi All,

 

I have a data set where every observation is repeated twice on consecutive days and I would like to delete the first observation. For example:

 

id          group             treatmentdate          day            month          year              

A1            0                 30Sep2017               30               9                  2017

A1            0                  1Oct2017                   1               10                 2017

A2            1                  06Nov2017                6                 11               2017

A2            1                   07Nov2017               7                  11              2017

A1            0                   23Oct2017                23                10              2017

A1             0                   24Oct2017               24               10                2017

 

What I would like:

id             group                 treatmentdate          day          month           year 

A1              0                      01Oct2017                 1              10                 2017

A2               1                   07Nov2017                   7               11                 2017

A1               0                     24Oct2017                  24              10               2017

 

 

I was attempting to use

proc sql;

create table want as 

select *

from have 

group by id group

having day=max(day)

run;

but this will not work as I have nothing more to group by to retain the same id from the same group with two different treatments (of two consecutive days). 

Any suggestions please? 

 

I have looked at using a DoW loop but haven't been successful. 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

If I understand what you describe and want then code like below should do the job.

data have;
  input (id group) ($) treatmentdate:date9. day month year;
  format treatmentdate date9.;
  datalines;
A1 0 30Sep2017 30 9 2017
A1 0 1Oct2017 1 10 2017
A2 1 06Nov2017 6 11 2017
A2 1 07Nov2017 7 11 2017
A1 0 23Oct2017 23 10 2017
A1 0 24Oct2017 24 10 2017
;

proc sort data=have out=inter;
  by id group DESCENDING treatmentdate;
run;

data want;
  set inter;
  by id group DESCENDING treatmentdate;
  if dif1(treatmentdate)=-1 and not first.group then delete;
run;

 NB: When using the DIF or LAG function make sure you use it in a place where it gets executed for every single row of your input data.

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

Given the condition that 

 

every observation is repeated twice on consecutive days and I would like to delete the first observation

 

here's the solution:

 

data want;
    set have;
    if mod(_n_,2)=0;
run;

 

 

--
Paige Miller
AMSAS
SAS Super FREQ

Take a look at Grouping Data and FIRST. and LAST. DATA Step Variables 
That should do what you want

Reeza
Super User

Every record will always have two entries? It also looks like you may have multiple entries for a single ID as A1 ID is repeated twice?

 

data want;
set have;
by ID group NOTSORTED;
if first.group;
run;

@EM_G wrote:

Hi All,

 

I have a data set where every observation is repeated twice on consecutive days and I would like to delete the first observation. For example:

 

id          group             treatmentdate          day            month          year              

A1            0                 30Sep2017               30               9                  2017

A1            0                  1Oct2017                   1               10                 2017

A2            1                  06Nov2017                6                 11               2017

A2            1                   07Nov2017               7                  11              2017

A1            0                   23Oct2017                23                10              2017

A1             0                   24Oct2017               24               10                2017

 

What I would like:

id             group                 treatmentdate          day          month           year 

A1              0                      01Oct2017                 1              10                 2017

A2               1                   07Nov2017                   7               11                 2017

A1               0                     24Oct2017                  24              10               2017

 

 

I was attempting to use

proc sql;

create table want as 

select *

from have 

group by id group

having day=max(day)

run;

but this will not work as I have nothing more to group by to retain the same id from the same group with two different treatments (of two consecutive days). 

Any suggestions please? 

 

I have looked at using a DoW loop but haven't been successful. 


 

EM_G
Fluorite | Level 6
Hi Reeza

Not every record has two entries and yes there are some multiple entries for id within a group and id may appear in multiple groups.
Kurt_Bremser
Super User

What if you have more than two entries for a given id/group combination? What if the dates are not consecutive? What if there's multiple obs for a single date?

 

Before we can come up with a reliable solution, this (and probably other) facts about your data need to be cleared up.

EM_G
Fluorite | Level 6
Facts about the data:
Every treatment occurs for two consecutive days (apart from <5% which is more than likely data input error. I will look at removing these entries).
There are some id/group combination that occur multiple times (The two day treatment is consecutive, but the repeated treatment is a random date).
There are multiple id's with the same and different id/group combinations on the same date.
Most id's progress through ascending groups.
Data is over 3 years and approx 50,000 obs (including duplicates on consecutive day).

Can I some how make a new variable that contains the second treatment only, then remove the initial variable containing the two?
I was thinking using if then statement, but unsure how to narrow it down to individual id/group combination.

All help is much appreciated.

Cheers
Em



Patrick
Opal | Level 21

If I understand what you describe and want then code like below should do the job.

data have;
  input (id group) ($) treatmentdate:date9. day month year;
  format treatmentdate date9.;
  datalines;
A1 0 30Sep2017 30 9 2017
A1 0 1Oct2017 1 10 2017
A2 1 06Nov2017 6 11 2017
A2 1 07Nov2017 7 11 2017
A1 0 23Oct2017 23 10 2017
A1 0 24Oct2017 24 10 2017
;

proc sort data=have out=inter;
  by id group DESCENDING treatmentdate;
run;

data want;
  set inter;
  by id group DESCENDING treatmentdate;
  if dif1(treatmentdate)=-1 and not first.group then delete;
run;

 NB: When using the DIF or LAG function make sure you use it in a place where it gets executed for every single row of your input data.

EM_G
Fluorite | Level 6
Patrick,

Thank you so much, this has worked. Would you mind explaining the 'not first.group'? I have read about this in the sas documents but have not got my head around it. What did it prevent from happening?

Thanks again for your time.
Patrick
Opal | Level 21

@EM_G wrote:
Patrick,

Thank you so much, this has worked. Would you mind explaining the 'not first.group'? I have read about this in the sas documents but have not got my head around it. What did it prevent from happening?

Thanks again for your time.

The first and last syntax in a data step allows you to identify the first or last row in a tuple. 

You only want to "de-dupe" within the same group for an id.

The syntax used compares the current treatment date to the treatment date of the previous obs also when this previous obs belonged to a different group. Nothing prevents this previous treatment date value to be by chance exactly one day earlier than the current one in the first obs of a new group . Using and not first.group ensures that the first obs of a new group always gets kept even if the treatment date comparison would return a -1.

 

Btw: Expression dif1(treatmentdate)=-1 must be the first element in the if condition to ensure it gets executed for every single row. If a logical expression gets evaluated to False then SAS won't necessarily execute later expression under the same if statement (i.e. if with two expressions connected via AND the first expression resolved to False then there is no need to even execute the second test).

EM_G
Fluorite | Level 6

Thank you for the explanation and your time.  

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1015 views
  • 1 like
  • 6 in conversation