BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zana
Calcite | Level 5

Hi all,

I have dataset with two variables (ID and DAY). Data are numerical SAS formats. I would like to put out the records for ID where:

  1. The first day should be <50 and the last day should be >100.
  2. All the consecutive days (for each ID) should be >15. So that, if in any interval between consecutive days that was less than 15 days the second one should be removed.
  3. In finally, i want to keep the IDs that have more than 2 valid observations (Day) within them.

For example:

Input is like this:

IDDay
126
138
1107
273
2105
2156
348
389
3116
3127
413
488
4154
532
598
5106
612
679
6109
727
739
7132
7169

Output should be like this:

IDDay
348
389
3116
413
488
4154
612
679
6109
727
7132
7169

Is there a way to to come up the output I want?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

OK. it looks like problem is a little complicated.

data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
3     140
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
 set have;
 by id;
 retain base;
 if first.id then base=day;
  else do;
          if (day-base) gt 15 then base=day; else delete;
        end;
drop base;
run;

data want;
 first=1;last=1;count=0;
 do until(last.id);
  set temp;
  by id;
  if first.id and day ge 50 then first=0;
  if last.id and day le 100 then last=0;
  count+1;
 end;
 do until(last.id);
  set temp;
  by id;
  if first and last and count gt 2 then output;
 end;
 keep id day;
run;

Xia Keshan

View solution in original post

12 REPLIES 12
DBailey
Lapis Lazuli | Level 10

why wouldn't id 1 be included in your output?

zana
Calcite | Level 5

Because the second condition is not exist. (2- if in any interval between consecutive days that was less than 15 days the second one should be removed). Where this condition is accepted, we will see the second day for this id (ID=1)should be delete (38-26=12, you see that it is less then 15). In finally only two valid observations will remain within this id, that this is due to the third condition is not acceptable. So, it should be omitted. zana

Reeza
Super User

Is this a valid statement of your requirements.


  1. The first value for day should be <50
  2. The last observation is the first where the day value is greater than 100
  3. If less than 15 between the day values then the value should not be considered.
  4. The ID needs to have 3 or more valid values to be included.
zana
Calcite | Level 5

Dear Reeza,

They are true.

The last observation should be more than 100.

If less than 15 between the day values, then the 2nd value removed (not all value delete).

zana

Astounding
PROC Star

It's a little messy, but here's a plan.  The first DO loop determines if an ID is needed, and the second outputs the needed records.

data want;

   id_wanted='?';

   n_records=1;

   do until (last.id);

      set have;

      by id;

      prior_day = day;

      if first.id then do;

         if day >= 50 then id_wanted='N';

      end;

      else do;

         if day - prior_day > 15 then do;

            n_records + 1;

            last_day = day;

         end;

      end;

   end;
   if id_wanted='?' and n_records > 2 and last_day > 100 then id_wanted='Y';

   do until (last.id);

      set have;

      by id;

      if id_wanted='Y' then do;

         if first.id or day - prior_day > 15 then output;

         prior_day = day;

      end;

   end;

   drop prior_day id_wanted n_records last_day;

run;

It's untested code, so it may need a little tweaking.  But it's at least most of the way there.  And there are simpler looking solutions, but they would require an extra DATA step to first remove records that are <= 15 days apart.

Good luck.

Haikuo
Onyx | Level 15

It seems to me that Hash() 's super maneuverability really helps to sort out these conditions. c1-c3 correspond to condition 1-3 respectively.

data have;

input ID    Day;

cards;

1     26

1     38

1     107

2     73

2     105

2     156

3     48

3     89

3     116

3     127

4     13

4     88

4     154

5     32

5     98

5     106

6     12

6     79

6     109

7     27

7     39

7     132

7     169

;

data _null_;

  if _n_=1 then do;

    declare hash h(multidata:'y', ordered:'a');

      h.definekey('id');

      h.definedata('id', 'day');

      h.definedone();

  end;

  do until (last.id);

  set have end=last;

    by id day;

       c1=ifn(first.id and day lt 50, 1, c1);

       if first.id or dif(day)>15 then do;

c2=sum(c2,1);

rc=h.add();

       end;

       c3=ifn(last.id and day gt 100,1, c3);

       if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();

end;

if last then rc=h.output(dataset:'want');

run;

Haikuo

Ksharp
Super User
data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
 set have;
 if id eq lag(id) and dif(day) le 15 then delete;
run;


data want;
 first=1;last=1;count=0;
 do until(last.id);
  set temp;
  by id;
  if first.id and day ge 50 then first=0;
  if last.id and day le 100 then last=0;
  count+1;
 end;
 do until(last.id);
  set temp;
  by id;
  if first and last and count gt 2 then output;
 end;
 keep id day;
run;

Xia Keshan

zana
Calcite | Level 5

Dear Ksharp, so thanks for your code.

But when i add one more day for each id, result is not true.

Please remind the 2nd condition "If less than 15 between the day values, then the 2nd value removed".

For example: Please add Day 140 for ID=3.

3     48

3     89

3     116

3     127

3     140

When we subtracting 116 from the next value ("127"), the output (11) can't crossing the critical value (15) so it should be delete. Where this row (ID=3 and Day=127) removed from input file, now the last valid day (up to this point is 116) should be subtracting from the next number (refers to the new number "Day=140"). 140-116= 24. You can see that it is more then the critical value, so it should be remained in the output file.

zana

Reeza
Super User

Here's a simple way, but more steps.

*remove days within 15;

data step1;

  set have;

  by id;

  day_diff=dif(day);

  if first.id then day_diff=.;

  if day_diff<15 and day_diff>. then delete;

run;

*check if first is less than 50, last > 100 and more than 3;

data step2;

  set step1;

  by id;

  retain flag1 flag2 count;

  if first.id then do;

  flag1=0;

  flag2=0;

  count=0;

  end;

  if first.id and day<50 then flag1=1;

  if last.id and day>100 then flag2=1;

  count+1;

run;

*get ids;

proc sql;

  create table id as

  select id

  from step2

  group by id

  having max(flag1)=1

  and max(flag2)=1

  and count>2;

quit;

proc sql;

  create table want as

  select *

  from have

  where id in (select id from id);

quit;


Haikuo
Onyx | Level 15

Missed some new development on this thread. A one-step Hash is still available, a small tweak applied, note the 3 140 will be included. If your RAM is big enough to hold your final output, then Hash should be among the most efficient approaches:

data have;

input ID    Day;

cards;

1     26

1     38

1     107

2     73

2     105

2     156

3     48

3     89

3     116

3     127

3     140

4     13

4     88

4     154

5     32

5     98

5     106

6     12

6     79

6     109

7     27

7     39

7     132

7     169

;

data _null_;

  if _n_=1 then do;

    declare hash h(multidata:'y', ordered:'a');

      h.definekey('id');

      h.definedata('id', 'day');

      h.definedone();

  end;

  do until (last.id);

  set have end=last;

    by id day;

       c1=ifn(first.id and day lt 50, 1, c1);

       if first.id then _r=day;

       if first.id or day-_r >15 then do;

c2=sum(c2,1);

rc=h.add();

_r=day;

       end;

       c3=ifn(last.id and day gt 100,1, c3);

       if last.id and sum(c1,c3,(c2>2)) < 3 then rc=h.remove();

end;

if last then rc=h.output(dataset:'want');

run;


Haikuo

Ksharp
Super User

OK. it looks like problem is a little complicated.

data have;
input ID    Day;
cards;
1     26
1     38
1     107
2     73
2     105
2     156
3     48
3     89
3     116
3     127
3     140
4     13
4     88
4     154
5     32
5     98
5     106
6     12
6     79
6     109
7     27
7     39
7     132
7     169
;
run;
data temp;
 set have;
 by id;
 retain base;
 if first.id then base=day;
  else do;
          if (day-base) gt 15 then base=day; else delete;
        end;
drop base;
run;

data want;
 first=1;last=1;count=0;
 do until(last.id);
  set temp;
  by id;
  if first.id and day ge 50 then first=0;
  if last.id and day le 100 then last=0;
  count+1;
 end;
 do until(last.id);
  set temp;
  by id;
  if first and last and count gt 2 then output;
 end;
 keep id day;
run;

Xia Keshan

zana
Calcite | Level 5

Hi Ksharp,

Thank you very much for your code which worked well.

Best regards

zana

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 956 views
  • 3 likes
  • 6 in conversation