Solved: Re: Flag break in sequence of pairs

WAL83 · Posted 07-13-2017 06:58 PM

Hello,

I am trying to figure out a way to add numbers (PAIR_SEQ column) when a patient's sequence of paired data (IntervalType=REST/ACTIVE where PAIR =1, 2, 3 etc) has a break in consecutive days between StartDate. For example, my dataset looks like the below table where PAIR_SEQ increases by 1 when days (StartDate) are non-sequential between paired records. See LINES = 11, 13, 19, 25, 31, 47 as examples. Any help is greatly appreciated.

LINE	ID	DataStartDate	IntervalType	PAIR	StartDate	PAIR_SEQ
1	1	19-Feb-13	REST	1	20-Feb-13	1
2	1	19-Feb-13	ACTIVE	1	20-Feb-13	1
3	1	19-Feb-13	REST	2	21-Feb-13	1
4	1	19-Feb-13	ACTIVE	2	21-Feb-13	1
5	1	19-Feb-13	REST	3	22-Feb-13	1
6	1	19-Feb-13	ACTIVE	3	22-Feb-13	1
7	1	19-Feb-13	REST	4	23-Feb-13	1
8	1	19-Feb-13	ACTIVE	4	23-Feb-13	1
9	1	19-Feb-13	REST	5	24-Feb-13	1
10	1	19-Feb-13	ACTIVE	5	24-Feb-13	1
11	1	26-Feb-13	REST	1	27-Feb-13	2
12	1	26-Feb-13	ACTIVE	1	27-Feb-13	2
13	1	26-Feb-13	REST	1	2-Mar-13	3
14	1	26-Feb-13	ACTIVE	1	2-Mar-13	3
15	1	26-Feb-13	REST	2	3-Mar-13	3
16	1	26-Feb-13	ACTIVE	2	3-Mar-13	3
17	1	26-Feb-13	REST	3	4-Mar-13	3
18	1	26-Feb-13	ACTIVE	3	4-Mar-13	3
19	1	11-Mar-13	REST	1	12-Mar-13	4
20	1	11-Mar-13	ACTIVE	1	12-Mar-13	4
21	1	11-Mar-13	REST	2	13-Mar-13	4
22	1	11-Mar-13	ACTIVE	2	13-Mar-13	4
23	1	11-Mar-13	REST	3	14-Mar-13	4
24	1	11-Mar-13	ACTIVE	3	14-Mar-13	4
25	1	18-Mar-13	REST	1	19-Mar-13	5
26	1	18-Mar-13	ACTIVE	1	19-Mar-13	5
27	1	18-Mar-13	REST	2	20-Mar-13	5
28	1	18-Mar-13	ACTIVE	2	20-Mar-13	5
29	1	18-Mar-13	REST	3	21-Mar-13	5
30	1	18-Mar-13	ACTIVE	3	21-Mar-13	5
31	1	18-Mar-13	REST	1	24-Mar-13	6
32	1	18-Mar-13	ACTIVE	1	24-Mar-13	6
33	1	18-Mar-13	REST	2	25-Mar-13	6
34	1	18-Mar-13	ACTIVE	2	25-Mar-13	6
35	1	18-Mar-13	REST	3	26-Mar-13	6
36	1	18-Mar-13	ACTIVE	3	26-Mar-13	6
37	1	18-Mar-13	REST	4	27-Mar-13	6
38	1	18-Mar-13	ACTIVE	4	27-Mar-13	6
39	1	18-Mar-13	REST	5	28-Mar-13	6
40	1	18-Mar-13	ACTIVE	5	28-Mar-13	6
41	1	18-Mar-13	REST	6	29-Mar-13	6
42	1	18-Mar-13	ACTIVE	6	29-Mar-13	6
43	1	18-Mar-13	REST	7	30-Mar-13	6
44	1	18-Mar-13	ACTIVE	7	30-Mar-13	6
45	1	18-Mar-13	REST	8	31-Mar-13	6
46	1	18-Mar-13	ACTIVE	8	31-Mar-13	6
47	1	16-Apr-13	REST	1	17-Apr-13	7
48	1	16-Apr-13	ACTIVE	1	17-Apr-13	7
49	1	16-Apr-13	REST	2	18-Apr-13	7
50	1	16-Apr-13	ACTIVE	2	18-Apr-13	7
51	1	16-Apr-13	REST	3	19-Apr-13	7
52	1	16-Apr-13	ACTIVE	3	19-Apr-13	7
53	1	16-Apr-13	REST	4	20-Apr-13	7
54	1	16-Apr-13	ACTIVE	4	20-Apr-13	7
55	1	16-Apr-13	REST	5	21-Apr-13	7
55	1	16-Apr-13	ACTIVE	5	21-Apr-13	7

Astounding · Posted 07-17-2017 07:19 PM

One way or another, that will take going through the data twice. SAS doesn't have a way to tell ahead of time how many observations there will be in a grouping. Here's one way:

data really_want;

merge want ( in=keep_these keep=id pair_seq within_seq where=(within_seq=6))

want;

by id pair_seq;

if keep_these;

run;

It's important to mention the data sets in the right order in the MERGE statement. And you would have to ignore the warnings about WITHIN_SEQ being part of both data sets and one value overwriting the other.

@WAL83 to address the final concern, taking just the first set of records that satisfy X=3 ...

data take_these;

set want;

by id;

where within_seq=6;

if first.id;

keep id pair_seq;

run;

data really_want;

merge want take_these (in=take);

by id pair_seq;

if take;

run;

View solution in original post

ballardw · Posted 07-13-2017 07:14 PM

Before starting are your "date" variables actually SAS date values, i.e. numeric with a SAS date format applied, or are they character?

If character we'll need to add a step to get the numeric versions for comparisons.

Since line 13 has the same date as line 12 is your rule description incomplete or should line 13 NOT have an increment?

I would also request clarification that 1) your data is organized by ID and 2) when the ID changes from 1 to 2, which is extremely likely to have break in sequence, that you want to restart the pair sequence value at 1.

WAL83 · Posted 07-13-2017 07:45 PM

Thanks for the quick reply.

Both Date variables (DataStartDate & StartDate) are both SAS DATE9. numeric.

@ line 13 DataStartDate does not change between records however StartDate does. Rules are based on StartDate not DataStartDate. My data is sorted by ID, StartDate, descending IntervalType. Sorry for not including that initially.

Yes, when ID changes from 1 to 2 PAIR_SEQ will start over at 1.

WAL83 · Posted 07-13-2017 09:57 PM

That seems to have done the trick. I added '...or gap < 0' since gap was sometimes negative although I don't think it matters based on the sort order. Thank you!

data want;

set have;

by ID START_DATE descending Interval_Type;

gap = dif(START_DATE);

if first.ID then pair_seq=1;

else if gap > 1 or gap < 0 then pair_seq + 1;

*drop gap;

run;

Astounding · Posted 07-13-2017 08:58 PM

Maybe I'm simplifying this too much, but I think this does the trick:

data want;

set have;

by id;

gap = dif(startdate);

if first.id then pair_seq=1;

else if gap > 1 then pair_seq + 1;

drop gap;

run;

WAL83 · Posted 07-14-2017 05:38 PM

Thanks again for your help.

After reviewing my data more thoroughly I found some anomalies that were not accouted for. My goal is to actually only keep records that have the first set of X number of consecutive pairs. The code you provided worked perfect but when I tried to reduce the set to where pair_seq=1, for some IDs it returned records with 1 pair or 2 consecutive pairs when those IDs had 3 consecutive pairs where pair_seq=2 or more. I provided 3 example cases below. Any help is much apprieciated.

The sort order of the data is: order by ID, STARTDATE, IntervalType desc;

Example 1, for first X=3 consecutive pairs (pair=1,1,2,2,3,3), I would only keep records in GREEN where ID = 1, 2 & 3 & drop records in RED

ID	DataStartDate	IntervalType	pair	StartDate	gap	pair_seq
1	2-Apr-14	REST	1	3-Apr-14	38	1
1	2-Apr-14	ACTIVE	1	3-Apr-14	0	1
1	2-Apr-14	REST	2	4-Apr-14	1	1
1	2-Apr-14	ACTIVE	2	4-Apr-14	0	1
1	2-Apr-14	REST	3	5-Apr-14	1	1
1	2-Apr-14	ACTIVE	3	5-Apr-14	0	1
1	2-Apr-14	REST	4	6-Apr-14	1	1
1	2-Apr-14	ACTIVE	4	6-Apr-14	0	1
1	2-Apr-14	REST	5	7-Apr-14	1	1
1	2-Apr-14	ACTIVE	5	7-Apr-14	0	1
2	26-Mar-13	REST	1	27-Mar-13	-14	1
2	26-Mar-13	ACTIVE	1	27-Mar-13	0	1
2	1-Apr-13	REST	1	2-Apr-13	6	2
2	1-Apr-13	ACTIVE	1	2-Apr-13	0	2
2	1-Apr-13	REST	2	3-Apr-13	1	2
2	1-Apr-13	ACTIVE	2	3-Apr-13	0	2
2	1-Apr-13	REST	3	4-Apr-13	1	2
2	1-Apr-13	ACTIVE	3	4-Apr-13	0	2
2	1-Apr-13	REST	4	5-Apr-13	1	2
2	1-Apr-13	ACTIVE	4	5-Apr-13	0	2
2	1-Apr-13	REST	5	6-Apr-13	1	2
2	1-Apr-13	ACTIVE	5	6-Apr-13	0	2
2	1-Apr-13	REST	6	7-Apr-13	1	2
2	1-Apr-13	ACTIVE	6	7-Apr-13	0	2
2	1-Apr-13	REST	7	8-Apr-13	1	2
2	1-Apr-13	ACTIVE	7	8-Apr-13	0	2
3	12-Sep-14	REST	1	13-Sep-14	-5	1
3	12-Sep-14	ACTIVE	1	13-Sep-14	0	1
3	12-Sep-14	REST	2	14-Sep-14	1	1
3	12-Sep-14	ACTIVE	2	14-Sep-14	0	1
3	16-Oct-14	REST	1	17-Oct-14	33	2
3	16-Oct-14	ACTIVE	1	17-Oct-14	0	2
3	16-Oct-14	REST	1	20-Oct-14	3	3
3	16-Oct-14	ACTIVE	1	20-Oct-14	0	3
3	16-Oct-14	REST	2	21-Oct-14	1	3
3	16-Oct-14	ACTIVE	2	21-Oct-14	0	3
3	16-Oct-14	REST	3	22-Oct-14	1	3
3	16-Oct-14	ACTIVE	3	22-Oct-14	0	3
3	16-Oct-14	REST	4	23-Oct-14	1	3
3	16-Oct-14	ACTIVE	4	23-Oct-14	0	3

Example 2, for first X=2 consecutive pairs (pair=1,1,2,2), I would only keep records in GREEN where ID = 1, 2 & 3 & drop records in RED

ID	DataStartDate	IntervalType	pair	StartDate	gap	pair_seq
1	2-Apr-14	REST	1	3-Apr-14	38	1
1	2-Apr-14	ACTIVE	1	3-Apr-14	0	1
1	2-Apr-14	REST	2	4-Apr-14	1	1
1	2-Apr-14	ACTIVE	2	4-Apr-14	0	1
1	2-Apr-14	REST	3	5-Apr-14	1	1
1	2-Apr-14	ACTIVE	3	5-Apr-14	0	1
1	2-Apr-14	REST	4	6-Apr-14	1	1
1	2-Apr-14	ACTIVE	4	6-Apr-14	0	1
1	2-Apr-14	REST	5	7-Apr-14	1	1
1	2-Apr-14	ACTIVE	5	7-Apr-14	0	1
2	26-Mar-13	REST	1	27-Mar-13	-14	1
2	26-Mar-13	ACTIVE	1	27-Mar-13	0	1
2	1-Apr-13	REST	1	2-Apr-13	6	2
2	1-Apr-13	ACTIVE	1	2-Apr-13	0	2
2	1-Apr-13	REST	2	3-Apr-13	1	2
2	1-Apr-13	ACTIVE	2	3-Apr-13	0	2
2	1-Apr-13	REST	3	4-Apr-13	1	2
2	1-Apr-13	ACTIVE	3	4-Apr-13	0	2
2	1-Apr-13	REST	4	5-Apr-13	1	2
2	1-Apr-13	ACTIVE	4	5-Apr-13	0	2
2	1-Apr-13	REST	5	6-Apr-13	1	2
2	1-Apr-13	ACTIVE	5	6-Apr-13	0	2
2	1-Apr-13	REST	6	7-Apr-13	1	2
2	1-Apr-13	ACTIVE	6	7-Apr-13	0	2
2	1-Apr-13	REST	7	8-Apr-13	1	2
2	1-Apr-13	ACTIVE	7	8-Apr-13	0	2
3	12-Sep-14	REST	1	13-Sep-14	-5	1
3	12-Sep-14	ACTIVE	1	13-Sep-14	0	1
3	12-Sep-14	REST	2	14-Sep-14	1	1
3	12-Sep-14	ACTIVE	2	14-Sep-14	0	1
3	16-Oct-14	REST	1	17-Oct-14	33	2
3	16-Oct-14	ACTIVE	1	17-Oct-14	0	2
3	16-Oct-14	REST	1	20-Oct-14	3	3
3	16-Oct-14	ACTIVE	1	20-Oct-14	0	3
3	16-Oct-14	REST	2	21-Oct-14	1	3
3	16-Oct-14	ACTIVE	2	21-Oct-14	0	3
3	16-Oct-14	REST	3	22-Oct-14	1	3
3	16-Oct-14	ACTIVE	3	22-Oct-14	0	3
3	16-Oct-14	REST	4	23-Oct-14	1	3
3	16-Oct-14	ACTIVE	4	23-Oct-14	0	3

Example 3, for first X=4 consecutive pairs (pair=1,1,2,2,3,3,4,4), I would only keep records in GREEN where ID = 1, 2 & 3 & drop records in RED

ID	DataStartDate	IntervalType	pair	StartDate	gap	pair_seq
1	2-Apr-14	REST	1	3-Apr-14	38	1
1	2-Apr-14	ACTIVE	1	3-Apr-14	0	1
1	2-Apr-14	REST	2	4-Apr-14	1	1
1	2-Apr-14	ACTIVE	2	4-Apr-14	0	1
1	2-Apr-14	REST	3	5-Apr-14	1	1
1	2-Apr-14	ACTIVE	3	5-Apr-14	0	1
1	2-Apr-14	REST	4	6-Apr-14	1	1
1	2-Apr-14	ACTIVE	4	6-Apr-14	0	1
1	2-Apr-14	REST	5	7-Apr-14	1	1
1	2-Apr-14	ACTIVE	5	7-Apr-14	0	1
2	26-Mar-13	REST	1	27-Mar-13	-14	1
2	26-Mar-13	ACTIVE	1	27-Mar-13	0	1
2	1-Apr-13	REST	1	2-Apr-13	6	2
2	1-Apr-13	ACTIVE	1	2-Apr-13	0	2
2	1-Apr-13	REST	2	3-Apr-13	1	2
2	1-Apr-13	ACTIVE	2	3-Apr-13	0	2
2	1-Apr-13	REST	3	4-Apr-13	1	2
2	1-Apr-13	ACTIVE	3	4-Apr-13	0	2
2	1-Apr-13	REST	4	5-Apr-13	1	2
2	1-Apr-13	ACTIVE	4	5-Apr-13	0	2
2	1-Apr-13	REST	5	6-Apr-13	1	2
2	1-Apr-13	ACTIVE	5	6-Apr-13	0	2
2	1-Apr-13	REST	6	7-Apr-13	1	2
2	1-Apr-13	ACTIVE	6	7-Apr-13	0	2
2	1-Apr-13	REST	7	8-Apr-13	1	2
2	1-Apr-13	ACTIVE	7	8-Apr-13	0	2
3	12-Sep-14	REST	1	13-Sep-14	-5	1
3	12-Sep-14	ACTIVE	1	13-Sep-14	0	1
3	12-Sep-14	REST	2	14-Sep-14	1	1
3	12-Sep-14	ACTIVE	2	14-Sep-14	0	1
3	16-Oct-14	REST	1	17-Oct-14	33	2
3	16-Oct-14	ACTIVE	1	17-Oct-14	0	2
3	16-Oct-14	REST	1	20-Oct-14	3	3
3	16-Oct-14	ACTIVE	1	20-Oct-14	0	3
3	16-Oct-14	REST	2	21-Oct-14	1	3
3	16-Oct-14	ACTIVE	2	21-Oct-14	0	3
3	16-Oct-14	REST	3	22-Oct-14	1	3
3	16-Oct-14	ACTIVE	3	22-Oct-14	0	3
3	16-Oct-14	REST	4	23-Oct-14	1	3
3	16-Oct-14	ACTIVE	4	23-Oct-14	0	3

Astounding · Posted 07-14-2017 10:38 PM

It gets a bit more complex ... you need to set up an additional counter. This is untested but looks right:

data want;

retain x 3; /* or 4 or whatever you choose */

set have;

by ID START_DATE descending Interval_Type;

gap = dif(START_DATE);

if first.ID then do;

pair_seq=1;

within_seq=1;

end;

else if gap > 1 or gap < 0 then do;

pair_seq + 1;

within_seq=1;

end;

else within_seq + 1;

if within_seq > 2 * x then delete;

*drop gap;

run;

WAL83 · Posted 07-17-2017 07:02 PM

Thank you. Your code is counting within_seq correctly and

if within_seq > 2 * x then delete;

removes records where within_seq > 6 when X=3, but pairs with less than X=3 are still present. How can I delete these? For example, the first sequence of X=3 pairs is pair_seq=3 for ID=3. I want to delete those records in red.

x	ID	DataStartDate	IntervalType	pair	StartDate	gap	pair_seq	within_seq
3	3	12-Sep-14	REST	1	13-Sep-14	-5	1	1
3	3	12-Sep-14	ACTIVE	1	13-Sep-14	0	1	2
3	3	12-Sep-14	REST	2	14-Sep-14	1	1	3
3	3	12-Sep-14	ACTIVE	2	14-Sep-14	0	1	4
3	3	16-Oct-14	REST	1	17-Oct-14	33	2	1
3	3	16-Oct-14	ACTIVE	1	17-Oct-14	0	2	2
3	3	16-Oct-14	REST	1	20-Oct-14	3	3	1
3	3	16-Oct-14	ACTIVE	1	20-Oct-14	0	3	2
3	3	16-Oct-14	REST	2	21-Oct-14	1	3	3
3	3	16-Oct-14	ACTIVE	2	21-Oct-14	0	3	4
3	3	16-Oct-14	REST	3	22-Oct-14	1	3	5
3	3	16-Oct-14	ACTIVE	3	22-Oct-14	0	3	6

Astounding · Posted 07-17-2017 07:19 PM

One way or another, that will take going through the data twice. SAS doesn't have a way to tell ahead of time how many observations there will be in a grouping. Here's one way:

data really_want;

merge want ( in=keep_these keep=id pair_seq within_seq where=(within_seq=6))

want;

by id pair_seq;

if keep_these;

run;

It's important to mention the data sets in the right order in the MERGE statement. And you would have to ignore the warnings about WITHIN_SEQ being part of both data sets and one value overwriting the other.

@WAL83 to address the final concern, taking just the first set of records that satisfy X=3 ...

data take_these;

set want;

by id;

where within_seq=6;

if first.id;

keep id pair_seq;

run;

data really_want;

merge want take_these (in=take);

by id pair_seq;

if take;

run;

WAL83 · Posted 07-17-2017 08:25 PM

Thank you.

The data step removes records with within_seq < 6 when x=3 but I also need to only keep the first pair_seq where X=3. Some IDs have more than 1 set of within_seq=1,2,3,4,5,6 when X=3. If I filter the dataset on pair_seq=1 I will miss records where their first within_seq=1,2,3,4,5,6 is greater than pair_seq=1 (ID=3).

For example, ID=5 has 3 pair_seq=1,2,3 where within_seq=1,2,3,4,5,6. I only need to keep the first pair_seq (pair_seq=1). I think this is the last issue to account for to achieve my goal.

ID	pair_seq	within_seq	x	Data Start Date	Interval Type	pair	Start Date	gap
3	3	1	3	16-Oct-14	REST	1	20-Oct-14	3
3	3	2	3	16-Oct-14	ACTIVE	1	20-Oct-14	0
3	3	3	3	16-Oct-14	REST	2	21-Oct-14	1
3	3	4	3	16-Oct-14	ACTIVE	2	21-Oct-14	0
3	3	5	3	16-Oct-14	REST	3	22-Oct-14	1
3	3	6	3	16-Oct-14	ACTIVE	3	22-Oct-14	0
5	1	1	3	22-May-13	REST	1	23-May-13	-42
5	1	2	3	22-May-13	ACTIVE	1	23-May-13	0
5	1	3	3	22-May-13	REST	2	24-May-13	1
5	1	4	3	22-May-13	ACTIVE	2	24-May-13	0
5	1	5	3	22-May-13	REST	3	25-May-13	1
5	1	6	3	22-May-13	ACTIVE	3	25-May-13	0
5	2	1	3	15-Jul-13	REST	1	16-Jul-13	42
5	2	2	3	15-Jul-13	ACTIVE	1	16-Jul-13	0
5	2	3	3	15-Jul-13	REST	2	17-Jul-13	1
5	2	4	3	15-Jul-13	ACTIVE	2	17-Jul-13	0
5	2	5	3	15-Jul-13	REST	3	18-Jul-13	1
5	2	6	3	15-Jul-13	ACTIVE	3	18-Jul-13	0
5	3	1	3	15-Jul-13	REST	1	29-Jul-13	6
5	3	2	3	15-Jul-13	ACTIVE	1	29-Jul-13	0
5	3	3	3	15-Jul-13	REST	2	30-Jul-13	1
5	3	4	3	15-Jul-13	ACTIVE	2	30-Jul-13	0
5	3	5	3	15-Jul-13	REST	3	31-Jul-13	1
5	3	6	3	15-Jul-13	ACTIVE	3	31-Jul-13	0

WAL83 · Posted 07-18-2017 01:14 PM

That works perfect! Thank you!!!

Classroom Training Available!