Solved: Re: splitting dataset into multiple datasets of only contiguous data

AllSoEasy · Posted 03-12-2014 04:41 PM

I'm wondering how, if I have a dataset of data that is in two contiguous sets of dates, i.e.:

date

1/2004

2/2004

3/2004

4/2004

5/2004

6/2004

7/2004

8/2004

9/2004

10/2004

11/2004

12/2004

1/2008

2/2008

3/2008

4/2008

5/2008

6/2008

7/2008

8/2008

9/2008

10/2008

11/2008

12/2008

And then would want separate datasets, one starting at 1/2004 up to the first discontiguos date (i.e. in the above simple example, until 1/2008) and then a dataset from 1/2008 onward. I'm having some trouble coding an algorithm for this as it must be very general, the starting date for each set of contiguous dates could be anything, there could be any amount of data from 100 months of data to 12, and each set could be any number of months apart, I just want to find the distinction as soon as their are two consecutive dates that are more than a month apart.

I deeply appreciate any suggestions.

Reeza · Posted 03-13-2014 12:01 PM

You need to initialize the variables for each "group" then, ie if first.group then set the lag date to missing and the count back to 1.

Everything else should stay the same:

data want;

set have;

by group date;

retain continuous_group 1;

prev_date=lag(date);

if first.group then do;

prev_date=.;

continuous_group=1;

end;

format prev_date date9.;

diff=intck('month', prev_date, date);

if intck('month', prev_date, date)>1 then continuous_group+1;

run;

View solution in original post

Reeza · Posted 03-12-2014 06:00 PM

Use a lag function to check the current date against previous date and create a grouping variable.

Are you separating into SAS datasets or a text file?

Haikuo · Posted 03-12-2014 11:25 PM

Like Reeza mentioned, retain or lag is the key to your problem. Here is a possible approach using lag() + Hash, giving you a dynamic output;

data have;

input date ANYDTDTE7.;

format date date9.;

cards;

1/2004

2/2004

3/2004

4/2004

5/2004

6/2004

7/2004

8/2004

9/2004

10/2004

11/2004

12/2004

1/2008

2/2008

3/2008

4/2008

5/2008

6/2008

7/2008

8/2008

9/2008

10/2008

11/2008

12/2008

10/2009

11/2009

12/2009

;

data _null_;

if _n_=1 then do;

declare hash h(ordered:'a');

h.definekey('date');

h.definedata('date');

h.definedone();

end;

set have end=last;

if intck('month',lag(date),date) >1 then do;

n+1;

rc=h.output(dataset:cats('split',n));

rc=h.clear();

end;

if last then do; rc=h.replace();rc=h.output(dataset:cats('split',n+1)); end;

rc=h.replace();

run;

Please note, if you don't want the group of single obs, you can easily tweak the code by adding a simple condition.

Haikuo

AllSoEasy · Posted 03-13-2014 11:48 AM

Reeza & Hai.kuo - thank you both very much for your suggestions. However there is one complication that I forgot to mention and was not clear at all in the very simplistic example I gave -- my data has a grouping variable, i.e. is essentially longitudinal data -- so the sets of dates are repeated accross multiple groups, i.e.

Group Date

---------------------------

group1 1/2004

group1 2/2004

...

group1 12/2008

group2 1/2004

group2 2/2004

...

group2 12/2008

...etc

And so that's why I believe Hai.kuo's code didn't full work as I got many different datasets with various date ranges.

- Also, Reeza, I am trying to work with lag, I understand how to determine when I first encounter the dis-contiguous month with lag(), but I don't know how to set the flag consistently across all dates after just the first dis-contiguous month, across all grouping scenarios.Also I am writing to temporary SAS datasets as these datasets will immediately be used for further analysis in the SAS script, and are no longer needed when the SAS script finishes executing.

Reeza · Posted 03-13-2014 12:01 PM

You need to initialize the variables for each "group" then, ie if first.group then set the lag date to missing and the count back to 1.

Everything else should stay the same:

data want;

set have;

by group date;

retain continuous_group 1;

prev_date=lag(date);

if first.group then do;

prev_date=.;

continuous_group=1;

end;

format prev_date date9.;

diff=intck('month', prev_date, date);

if intck('month', prev_date, date)>1 then continuous_group+1;

run;

Tom · Posted 03-13-2014 12:05 PM

Assuming you have actual dates instead of just strings that look like mm/yyyy.

Assign a group variable than changes when there is a gap in dates. Restart when a new id group is started.

data want ;

set have ;

by id date ;

lag_date = lag(date);

if first.id then group=1 ;

else if intnx('month',lag_date,1) ^= intnx('month',date,0) then group+1;

run;

You probably do NOT need to make three datasets (how do you know it will be three and not 100?). Just use the new GROUP variable in you later code.

proc means data=want ;

where group=1;

run;

AllSoEasy · Posted 03-13-2014 12:22 PM

Thank you both! both answers work perfectly. I understand now, thank you so much!!

Classroom Training Available!