Hi all,
I have dataset with two variables (ID and DAY). Data are numerical SAS formats. I would like to put out the records for ID where:
For example:
Input is like this:
ID | Day |
1 | 26 |
1 | 38 |
1 | 107 |
2 | 73 |
2 | 105 |
2 | 156 |
3 | 48 |
3 | 89 |
3 | 116 |
3 | 127 |
4 | 13 |
4 | 88 |
4 | 154 |
5 | 32 |
5 | 98 |
5 | 106 |
6 | 12 |
6 | 79 |
6 | 109 |
7 | 27 |
7 | 39 |
7 | 132 |
7 | 169 |
Output should be like this:
ID | Day |
3 | 48 |
3 | 89 |
3 | 116 |
4 | 13 |
4 | 88 |
4 | 154 |
6 | 12 |
6 | 79 |
6 | 109 |
7 | 27 |
7 | 132 |
7 | 169 |
Is there a way to to come up the output I want?
Thank you
OK. it looks like problem is a little complicated.
data have; input ID Day; cards; 1 26 1 38 1 107 2 73 2 105 2 156 3 48 3 89 3 116 3 127 3 140 4 13 4 88 4 154 5 32 5 98 5 106 6 12 6 79 6 109 7 27 7 39 7 132 7 169 ; run; data temp; set have; by id; retain base; if first.id then base=day; else do; if (day-base) gt 15 then base=day; else delete; end; drop base; run; data want; first=1;last=1;count=0; do until(last.id); set temp; by id; if first.id and day ge 50 then first=0; if last.id and day le 100 then last=0; count+1; end; do until(last.id); set temp; by id; if first and last and count gt 2 then output; end; keep id day; run;
Xia Keshan
why wouldn't id 1 be included in your output?
Because the second condition is not exist. (2- if in any interval between consecutive days that was less than 15 days the second one should be removed). Where this condition is accepted, we will see the second day for this id (ID=1)should be delete (38-26=12, you see that it is less then 15). In finally only two valid observations will remain within this id, that this is due to the third condition is not acceptable. So, it should be omitted. zana
Is this a valid statement of your requirements.
Dear Reeza,
They are true.
The last observation should be more than 100.
If less than 15 between the day values, then the 2nd value removed (not all value delete).
zana
It's a little messy, but here's a plan. The first DO loop determines if an ID is needed, and the second outputs the needed records.
data want;
id_wanted='?';
n_records=1;
do until (last.id);
set have;
by id;
prior_day = day;
if first.id then do;
if day >= 50 then id_wanted='N';
end;
else do;
if day - prior_day > 15 then do;
n_records + 1;
last_day = day;
end;
end;
end;
if id_wanted='?' and n_records > 2 and last_day > 100 then id_wanted='Y';
do until (last.id);
set have;
by id;
if id_wanted='Y' then do;
if first.id or day - prior_day > 15 then output;
prior_day = day;
end;
end;
drop prior_day id_wanted n_records last_day;
run;
It's untested code, so it may need a little tweaking. But it's at least most of the way there. And there are simpler looking solutions, but they would require an extra DATA step to first remove records that are <= 15 days apart.
Good luck.
It seems to me that Hash() 's super maneuverability really helps to sort out these conditions. c1-c3 correspond to condition 1-3 respectively.
data have;
input ID Day;
cards;
1 26
1 38
1 107
2 73
2 105
2 156
3 48
3 89
3 116
3 127
4 13
4 88
4 154
5 32
5 98
5 106
6 12
6 79
6 109
7 27
7 39
7 132
7 169
;
data _null_;
if _n_=1 then do;
declare hash h(multidata:'y', ordered:'a');
h.definekey('id');
h.definedata('id', 'day');
h.definedone();
end;
do until (last.id);
set have end=last;
by id day;
c1=ifn(first.id and day lt 50, 1, c1);
if first.id or dif(day)>15 then do;
c2=sum(c2,1);
rc=h.add();
end;
c3=ifn(last.id and day gt 100,1, c3);
if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();
end;
if last then rc=h.output(dataset:'want');
run;
Haikuo
data have; input ID Day; cards; 1 26 1 38 1 107 2 73 2 105 2 156 3 48 3 89 3 116 3 127 4 13 4 88 4 154 5 32 5 98 5 106 6 12 6 79 6 109 7 27 7 39 7 132 7 169 ; run; data temp; set have; if id eq lag(id) and dif(day) le 15 then delete; run; data want; first=1;last=1;count=0; do until(last.id); set temp; by id; if first.id and day ge 50 then first=0; if last.id and day le 100 then last=0; count+1; end; do until(last.id); set temp; by id; if first and last and count gt 2 then output; end; keep id day; run;
Xia Keshan
Dear Ksharp, so thanks for your code.
But when i add one more day for each id, result is not true.
Please remind the 2nd condition "If less than 15 between the day values, then the 2nd value removed".
For example: Please add Day 140 for ID=3.
3 48
3 89
3 116
3 127
3 140
When we subtracting 116 from the next value ("127"), the output (11) can't crossing the critical value (15) so it should be delete. Where this row (ID=3 and Day=127) removed from input file, now the last valid day (up to this point is 116) should be subtracting from the next number (refers to the new number "Day=140"). 140-116= 24. You can see that it is more then the critical value, so it should be remained in the output file.
zana
Here's a simple way, but more steps.
*remove days within 15;
data step1;
set have;
by id;
day_diff=dif(day);
if first.id then day_diff=.;
if day_diff<15 and day_diff>. then delete;
run;
*check if first is less than 50, last > 100 and more than 3;
data step2;
set step1;
by id;
retain flag1 flag2 count;
if first.id then do;
flag1=0;
flag2=0;
count=0;
end;
if first.id and day<50 then flag1=1;
if last.id and day>100 then flag2=1;
count+1;
run;
*get ids;
proc sql;
create table id as
select id
from step2
group by id
having max(flag1)=1
and max(flag2)=1
and count>2;
quit;
proc sql;
create table want as
select *
from have
where id in (select id from id);
quit;
Missed some new development on this thread. A one-step Hash is still available, a small tweak applied, note the 3 140 will be included. If your RAM is big enough to hold your final output, then Hash should be among the most efficient approaches:
data have;
input ID Day;
cards;
1 26
1 38
1 107
2 73
2 105
2 156
3 48
3 89
3 116
3 127
3 140
4 13
4 88
4 154
5 32
5 98
5 106
6 12
6 79
6 109
7 27
7 39
7 132
7 169
;
data _null_;
if _n_=1 then do;
declare hash h(multidata:'y', ordered:'a');
h.definekey('id');
h.definedata('id', 'day');
h.definedone();
end;
do until (last.id);
set have end=last;
by id day;
c1=ifn(first.id and day lt 50, 1, c1);
if first.id then _r=day;
if first.id or day-_r >15 then do;
c2=sum(c2,1);
rc=h.add();
_r=day;
end;
c3=ifn(last.id and day gt 100,1, c3);
if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();
end;
if last then rc=h.output(dataset:'want');
run;
Haikuo
OK. it looks like problem is a little complicated.
data have; input ID Day; cards; 1 26 1 38 1 107 2 73 2 105 2 156 3 48 3 89 3 116 3 127 3 140 4 13 4 88 4 154 5 32 5 98 5 106 6 12 6 79 6 109 7 27 7 39 7 132 7 169 ; run; data temp; set have; by id; retain base; if first.id then base=day; else do; if (day-base) gt 15 then base=day; else delete; end; drop base; run; data want; first=1;last=1;count=0; do until(last.id); set temp; by id; if first.id and day ge 50 then first=0; if last.id and day le 100 then last=0; count+1; end; do until(last.id); set temp; by id; if first and last and count gt 2 then output; end; keep id day; run;
Xia Keshan
Hi Ksharp,
Thank you very much for your code which worked well.
Best regards
zana
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.