## Restrictions on the choice of observations

Solved
Frequent Contributor
Posts: 81

# Restrictions on the choice of observations

Hi all,

I have dataset with two variables (ID and DAY). Data are numerical SAS formats. I would like to put out the records for ID where:

1. The first day should be <50 and the last day should be >100.
2. All the consecutive days (for each ID) should be >15. So that, if in any interval between consecutive days that was less than 15 days the second one should be removed.
3. In finally, i want to keep the IDs that have more than 2 valid observations (Day) within them.

For example:

Input is like this:

 ID Day 1 26 1 38 1 107 2 73 2 105 2 156 3 48 3 89 3 116 3 127 4 13 4 88 4 154 5 32 5 98 5 106 6 12 6 79 6 109 7 27 7 39 7 132 7 169

Output should be like this:

 ID Day 3 48 3 89 3 116 4 13 4 88 4 154 6 12 6 79 6 109 7 27 7 132 7 169

Is there a way to to come up the output I want?

Thank you

Accepted Solutions
Solution
‎04-26-2014 11:37 PM
Super User
Posts: 10,787

## Re: Restrictions on the choice of observations

OK. it looks like problem is a little complicated.

```data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
3     140
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
set have;
by id;
retain base;
if first.id then base=day;
else do;
if (day-base) gt 15 then base=day; else delete;
end;
drop base;
run;

data want;
first=1;last=1;count=0;
do until(last.id);
set temp;
by id;
if first.id and day ge 50 then first=0;
if last.id and day le 100 then last=0;
count+1;
end;
do until(last.id);
set temp;
by id;
if first and last and count gt 2 then output;
end;
keep id day;
run;

```

Xia Keshan

All Replies
Super Contributor
Posts: 578

## Re: Restrictions on the choice of observations

why wouldn't id 1 be included in your output?

Frequent Contributor
Posts: 81

## Re: Restrictions on the choice of observations

Because the second condition is not exist. (2- if in any interval between consecutive days that was less than 15 days the second one should be removed). Where this condition is accepted, we will see the second day for this id (ID=1)should be delete (38-26=12, you see that it is less then 15). In finally only two valid observations will remain within this id, that this is due to the third condition is not acceptable. So, it should be omitted. zana

Super User
Posts: 23,776

## Re: Restrictions on the choice of observations

Is this a valid statement of your requirements.

1. The first value for day should be <50
2. The last observation is the first where the day value is greater than 100
3. If less than 15 between the day values then the value should not be considered.
4. The ID needs to have 3 or more valid values to be included.
Frequent Contributor
Posts: 81

## Re: Restrictions on the choice of observations

Dear Reeza,

They are true.

The last observation should be more than 100.

If less than 15 between the day values, then the 2nd value removed (not all value delete).

zana

Super User
Posts: 6,785

## Re: Restrictions on the choice of observations

It's a little messy, but here's a plan.  The first DO loop determines if an ID is needed, and the second outputs the needed records.

data want;

id_wanted='?';

n_records=1;

do until (last.id);

set have;

by id;

prior_day = day;

if first.id then do;

if day >= 50 then id_wanted='N';

end;

else do;

if day - prior_day > 15 then do;

n_records + 1;

last_day = day;

end;

end;

end;
if id_wanted='?' and n_records > 2 and last_day > 100 then id_wanted='Y';

do until (last.id);

set have;

by id;

if id_wanted='Y' then do;

if first.id or day - prior_day > 15 then output;

prior_day = day;

end;

end;

drop prior_day id_wanted n_records last_day;

run;

It's untested code, so it may need a little tweaking.  But it's at least most of the way there.  And there are simpler looking solutions, but they would require an extra DATA step to first remove records that are <= 15 days apart.

Good luck.

Posts: 3,167

## Re: Restrictions on the choice of observations

It seems to me that Hash() 's super maneuverability really helps to sort out these conditions. c1-c3 correspond to condition 1-3 respectively.

data have;

input ID    Day;

cards;

1     26

1     38

1     107

2     73

2     105

2     156

3     48

3     89

3     116

3     127

4     13

4     88

4     154

5     32

5     98

5     106

6     12

6     79

6     109

7     27

7     39

7     132

7     169

;

data _null_;

if _n_=1 then do;

declare hash h(multidata:'y', ordered:'a');

h.definekey('id');

h.definedata('id', 'day');

h.definedone();

end;

do until (last.id);

set have end=last;

by id day;

c1=ifn(first.id and day lt 50, 1, c1);

if first.id or dif(day)>15 then do;

c2=sum(c2,1);

end;

c3=ifn(last.id and day gt 100,1, c3);

if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();

end;

if last then rc=h.output(dataset:'want');

run;

Haikuo

Super User
Posts: 10,787

## Re: Restrictions on the choice of observations

```data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
set have;
if id eq lag(id) and dif(day) le 15 then delete;
run;

data want;
first=1;last=1;count=0;
do until(last.id);
set temp;
by id;
if first.id and day ge 50 then first=0;
if last.id and day le 100 then last=0;
count+1;
end;
do until(last.id);
set temp;
by id;
if first and last and count gt 2 then output;
end;
keep id day;
run;

```

Xia Keshan

Frequent Contributor
Posts: 81

## Re: Restrictions on the choice of observations

Dear Ksharp, so thanks for your code.

But when i add one more day for each id, result is not true.

Please remind the 2nd condition "If less than 15 between the day values, then the 2nd value removed".

3     48

3     89

3     116

3     127

3     140

When we subtracting 116 from the next value ("127"), the output (11) can't crossing the critical value (15) so it should be delete. Where this row (ID=3 and Day=127) removed from input file, now the last valid day (up to this point is 116) should be subtracting from the next number (refers to the new number "Day=140"). 140-116= 24. You can see that it is more then the critical value, so it should be remained in the output file.

zana

Super User
Posts: 23,776

## Re: Restrictions on the choice of observations

Here's a simple way, but more steps.

*remove days within 15;

data step1;

set have;

by id;

day_diff=dif(day);

if first.id then day_diff=.;

if day_diff<15 and day_diff>. then delete;

run;

*check if first is less than 50, last > 100 and more than 3;

data step2;

set step1;

by id;

retain flag1 flag2 count;

if first.id then do;

flag1=0;

flag2=0;

count=0;

end;

if first.id and day<50 then flag1=1;

if last.id and day>100 then flag2=1;

count+1;

run;

*get ids;

proc sql;

create table id as

select id

from step2

group by id

having max(flag1)=1

and max(flag2)=1

and count>2;

quit;

proc sql;

create table want as

select *

from have

where id in (select id from id);

quit;

Posts: 3,167

## Re: Restrictions on the choice of observations

Missed some new development on this thread. A one-step Hash is still available, a small tweak applied, note the 3 140 will be included. If your RAM is big enough to hold your final output, then Hash should be among the most efficient approaches:

data have;

input ID    Day;

cards;

1     26

1     38

1     107

2     73

2     105

2     156

3     48

3     89

3     116

3     127

3     140

4     13

4     88

4     154

5     32

5     98

5     106

6     12

6     79

6     109

7     27

7     39

7     132

7     169

;

data _null_;

if _n_=1 then do;

declare hash h(multidata:'y', ordered:'a');

h.definekey('id');

h.definedata('id', 'day');

h.definedone();

end;

do until (last.id);

set have end=last;

by id day;

c1=ifn(first.id and day lt 50, 1, c1);

if first.id then _r=day;

if first.id or day-_r >15 then do;

c2=sum(c2,1);

_r=day;

end;

c3=ifn(last.id and day gt 100,1, c3);

if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();

end;

if last then rc=h.output(dataset:'want');

run;

Haikuo

Solution
‎04-26-2014 11:37 PM
Super User
Posts: 10,787

## Re: Restrictions on the choice of observations

OK. it looks like problem is a little complicated.

```data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
3     140
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
set have;
by id;
retain base;
if first.id then base=day;
else do;
if (day-base) gt 15 then base=day; else delete;
end;
drop base;
run;

data want;
first=1;last=1;count=0;
do until(last.id);
set temp;
by id;
if first.id and day ge 50 then first=0;
if last.id and day le 100 then last=0;
count+1;
end;
do until(last.id);
set temp;
by id;
if first and last and count gt 2 then output;
end;
keep id day;
run;

```

Xia Keshan

Frequent Contributor
Posts: 81

## Re: Restrictions on the choice of observations

Hi Ksharp,

Thank you very much for your code which worked well.

Best regards

zana

🔒 This topic is solved and locked.