BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bgosiker
Obsidian | Level 7

Hi! 

 

I am using a long dataset that is set up as shown below. I am trying to find the FIRST instance where the "value" variable is >=16 at least 4 times within a given 730-day period. These 4 occurences of value>=16 do not necessarily need to be consecutive, just occuring within 730 days of each other. I would also like to generate another variable that is the date of the 4th occurence of value>=16. The visit_date variable is in the form of days after a certain date that is 0, so they are formatted as continuous numbers. 

 

In the example below I would like it to identify for id=1 that visit_date=1700 is the 4th visit within a 730-day period where value>=16. And I'd like the visit_date of 1700 for this person to be output. For id=2 it would NOT flag anything since there are 4 instances of value>=16, but these do not occur within a 730 day window (1000-124>730). 

 

ID      visit_date     value 

1        100              5

1        220              16

1        330              22

1        700              12

1        915              5

1        1120            2

1        1200            17

1        1300            27

1        1500            30

1        1700            35

1        1922            5

2        124              17

2        450              18

2        555              2

2        740              29

2        1000            23

 

 

I have tried using an approach of first.id but haven't been able to get anything close to what I want. I'm not sure if an array of some form may be better for this! I'm working in SAS 9.4. 

 

Thanks for any help! 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Here is one way.  Note that it will output all observations where there are four or more large values within 730 days.

It is interesting because it actually calls the LAG3() function conditionally so that it only lags the visit dates for the days with high values.

data want;
 set have ;
 by id visit_date;
 row+1;
 if first.id then n=0;
 if value >= 16 then do;
   n+1;
   lag_date = lag3(visit_date);
   if n=4 then do;
     if lag_date + 730 >= visit_date then output;
     else n=3;
   end;
 end;
run;

View solution in original post

7 REPLIES 7
Reeza
Super User

Can you please post what you've tried so far?

 


@bgosiker wrote:

Hi! 

 

I am using a long dataset that is set up as shown below. I am trying to find the FIRST instance where the "value" variable is >=16 at least 4 times within a given 730-day period. These 4 occurences of value>=16 do not necessarily need to be consecutive, just occuring within 730 days of each other. I would also like to generate another variable that is the date of the 4th occurence of value>=16. The visit_date variable is in the form of days after a certain date that is 0, so they are formatted as continuous numbers. 

 

In the example below I would like it to identify for id=1 that visit_date=1700 is the 4th visit within a 730-day period where value>=16. And I'd like the visit_date of 1700 for this person to be output. For id=2 it would NOT flag anything since there are 4 instances of value>=16, but these do not occur within a 730 day window (1000-124>730). 

 

ID      visit_date     value 

1        100              5

1        220              16

1        330              22

1        700              12

1        915              5

1        1120            2

1        1200            17

1        1300            27

1        1500            30

1        1700            35

1        1922            5

2        124              17

2        450              18

2        555              2

2        740              29

2        1000            23

 

 

I have tried using an approach of first.id but haven't been able to get anything close to what I want. I'm not sure if an array of some form may be better for this! I'm working in SAS 9.4. 

 

Thanks for any help! 

 


 

bgosiker
Obsidian | Level 7

I've tried first creating an indicator for when value>=16 and then trying to do a count of that variable but haven't been able to finish the data step to identify the 4 consecutive value>=16 within the period I want. 

 

data visit1; set visit; 
     if value>=16 then value_gt=1; 
     if value<16 then value_gt=0; 
run; 

proc sort data=work.visit1; 
    by id visit_date value; 
run; 

data visit1; set visit1; 
    by id visit_date value; 
    if first.value then value_count=1
        value_count + value_gt; 
   
   

mkeintz
PROC Star

Keeping track of the cumulative count of records with value>=16 for each ID is the straightforward part.  But every time you encounter such a record, you must compare the current visit_date to the 3rd prior such visit_date, as well as seeing if N16>=4.  You can do this with the lag3(visit_date) function, as long as you only execute that function when you have a qualifying record in hand:

data have; 
input ID      visit_date     value ;
datalines;
1        100              5
1        220              16
1        330              22
1        700              12
1        915              5
1        1120            2
1        1200            17
1        1300            27
1        1500            30
1        1700            35
1        1922            5
2        124              17
2        450              18
2        555              2
2        740              29
2        1000            23
run;

data want (drop=n16);
  set have;
  by id;

  if first.id then n16=0;
  if value>=16 then do;
    n16+1;
    if n16>=4 and lag3(visit_date)>=visit_date-730 then flag=1;
    if flag=1 then n16=-1000000;
  end;
run;

 

  1. The "by id" statement, in tandem with the "set have", generates temporary dummy variables first.id and last.id indicating whether the observation in hand is the first or last one for the current ID value.  This provides a way the reset the N16 counter (N of records with value>=16) to zero when starting a new id group.

  2. Then, only when the current record is 16 or more:
    1. Increment N16
    2. if both N16>=4  (4 or more for an ID) and the 3rd prior visit date is within 730 days, set flag=1.
    3. But once flag=1 reset N16 to an impossibly negative value, so it will never again attain a value of 4 or more for the current id.
  3. The tests of N16>=4  and lag3(visit_date)>=visit_date-730, must be inside the do-group.  That is because the lag3 function is not a "look back" like in excel. It is a 3-deep queue updater where the queue contents are retained across observations.  And the queue must be updated only when you have value>=16.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Here is one way.  Note that it will output all observations where there are four or more large values within 730 days.

It is interesting because it actually calls the LAG3() function conditionally so that it only lags the visit dates for the days with high values.

data want;
 set have ;
 by id visit_date;
 row+1;
 if first.id then n=0;
 if value >= 16 then do;
   n+1;
   lag_date = lag3(visit_date);
   if n=4 then do;
     if lag_date + 730 >= visit_date then output;
     else n=3;
   end;
 end;
run;
bgosiker
Obsidian | Level 7
This looks like it did the trick! I really appreciate it!
s_lassen
Meteorite | Level 14

 

Another possibility is to use a DoW loop (with a counter), a WHERE clause, and the DIF function instead of the LAG function:

data want;
  do _N_=1 by 1 until(last.id);
    set have;
    where value>=16;
    by id;
    if dif3(visit_date)<=730 and _N_>=4 then
      output;
    end;
run;
mkeintz
PROC Star

But this logic generates ALL qualifying events.  I believe the OP only wanted the first such event.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1978 views
  • 2 likes
  • 5 in conversation