Iterations through observations with duplicates ID

Accepted Solution Solved
Reply
Contributor
Posts: 23
Accepted Solution

Iterations through observations with duplicates ID

Hi everyone,

I have a dataset with the following format :

ID      StartDate     EndDate

1          1991          1992

1          1993          2005

2          1996          2013

3          1995          1998

3          1998          2002

3          2005          2207

...

I would like to do a loop through all observations, precisely through each ID's observations. In each iteration (assuming for the same ID), I want to check if the StartDate is equal to the EndDate of the previous period (previous observation). If yes, replace the EndDate on the previous observation by the the actual EndDate and then delete the row.

For example : for ID = 3, EndDate1 = StartDate2 = 1998 => EndDate1 = EndDate2 = 2002

ID      StartDate     EndDate

1          1991          1992

1          1993          2005

2          1996          2013

3          1995          2002

3          2005          2207

Finally, there also may happen that there are 3 or more consecutive periods with this problem (StartDate_{N} = EndDate_{N-1}), but for now I only consider the problem with only 2 consecutive periods. If someone can suggest a solution, it would be greatly appreciated !!!

Thanks


Accepted Solutions
Solution
‎05-13-2014 10:40 PM
Respected Advisor
Posts: 4,925

Re: Iterations through observations with duplicates ID

It turns out it is not more complicated to merge all consecutive periods (I added one obs to your example) :

data have;

input ID      StartYear     EndYear;

datalines;

1          1991          1992

1          1993          2005

2          1996          2013

3          1995          1998

3          1998          2002

3          2002          2004

3          2005          2007

;

data want;

do until(last.ID);

  set have; by ID;

  if startYear > _EndYear then do;

       if not first.ID then output;

       _StartYear = StartYear;

       end;

  _EndYear = EndYear;

  end;

output;

drop StartYear EndYear;

rename _StartYear=StartYear _EndYear=EndYear;

run;

proc print data=want noobs; var ID StartYear EndYear; run;

PG

PG

View solution in original post


All Replies
Solution
‎05-13-2014 10:40 PM
Respected Advisor
Posts: 4,925

Re: Iterations through observations with duplicates ID

It turns out it is not more complicated to merge all consecutive periods (I added one obs to your example) :

data have;

input ID      StartYear     EndYear;

datalines;

1          1991          1992

1          1993          2005

2          1996          2013

3          1995          1998

3          1998          2002

3          2002          2004

3          2005          2007

;

data want;

do until(last.ID);

  set have; by ID;

  if startYear > _EndYear then do;

       if not first.ID then output;

       _StartYear = StartYear;

       end;

  _EndYear = EndYear;

  end;

output;

drop StartYear EndYear;

rename _StartYear=StartYear _EndYear=EndYear;

run;

proc print data=want noobs; var ID StartYear EndYear; run;

PG

PG
Super User
Posts: 10,035

Re: Iterations through observations with duplicates ID

This code can remove all the consecutive date.

data have;
input ID      StartDate     EndDate;
cards;
1          1991          1992
1          1993          2005
2          1996          2013
3          1995          1998
3          1998          2002
3          2005          2207
;
run;
data temp;
 set have;
 date=StartDate;name='StartDate';output;
 date=EndDate;name='EndDate';output;
 drop StartDate     EndDate;
run;
data x;
 set temp;
 by id date notsorted;
 if first.date and last.date;
 run;
 data x;
  set x;
  if name='StartDate' then     n+1;
run;
proc transpose data=x out=want(drop=_: n);
by id n;
id name;
var date;
run;

Xia Keshan

Contributor
Posts: 23

Re: Iterations through observations with duplicates ID

Thanks very much PGStats and Ksharp !!!

Both your solutions are correct but I'm more tempting to know more about PGStats' solution because I failed to comprehend it at once. And then after a few reseachs I know that this method is called DOW-loop by Ian Whitlock.

Back to the question, let me expand my question a little bit. Assuming I have another variable named "Code" which is a number associated with the period.

data have;

input ID   Code   StartYear     EndYear;

datalines;

1     10     1991          1992

1     13     1993          2005

2       5     1996          2013

3     20     1995          1998

3     21     1998          2002

3     22     2002          2004

3     25     2005          2007

;

If I'm using PGStats' solution and by adding _Code = Code

data want;

do until(last.ID);

  set have; by ID;

  if startYear > _EndYear then do;

       if not first.ID then output;

       _StartYear = StartYear;

       end;

  _EndYear = EndYear;

  _Code = Code;

  end;

output;

run;

I get the results below :

ID     Code     StartYear     EndYear     _StartYear     _EndYear     _Code

1          13            1993           2005              1991            1992          10

1          13            1993           2005              1993            2005          13

2            5            1996           2013              1996            2013            5

3          25            2005           2007              1995            2004           22

3          25            2005           2007              2005            2007           25

If you look at the _Code variable for ID = 3, then the value 22 is from the last period of 3 which were merged. I wonder if instead of having the last code, can we store the first code (in this case 20 from period 1995 - 1998) ? Can we also keep track of how many periods and codes we deleted during the process (a new variable maybe) ?

Thanks very much !!!

Respected Advisor
Posts: 4,925

Re: Iterations through observations with duplicates ID

Here is the method for saving stuff from the first and/or last merged periods and to count the periods:

data have;

input ID      StartYear     EndYear StartStuff $ EndStuff $;

datalines;

1          1991          1992 S_91 E_92

1          1993          2005 S_93 E_05

2          1996          2013 S_96 E_13

3          1995          1998 S_95 E_98

3          1998          2002 S_98 E_02

3          2002          2004 S_02 E_04

3          2005          2007 S_05 E_07

;

data want;

do until(last.ID);

  set have; by ID;

  if startYear > _EndYear then do;

       if not first.ID then output;

       _StartYear = StartYear;

       _StartStuff = StartStuff;

       periodCount = 0;

       end;

  _EndYear = EndYear;

  _EndStuff = EndStuff;

  periodCount + 1;

  end;

output;

drop StartYear EndYear StartStuff EndStuff;

rename _StartYear=StartYear _EndYear=EndYear

  _StartStuff=StartStuff _EndStuff=EndStuff;

run;

proc print data=want noobs;

var ID StartYear EndYear StartStuff EndStuff periodCount;

run;

PG

PG
Contributor
Posts: 23

Re: Iterations through observations with duplicates ID

Thanks again PGStats !!! Your answers are really helpful.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 550 views
  • 6 likes
  • 3 in conversation