BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Khaladdin
Quartz | Level 8

Hi all, 

 

Happy New Year!

For each non-missing value in Column X, I want to fill the previous 5 and the next 1 observation with this non-missing value. 

 

In short, I have the following dataset:

 

Column X 

.

.

.

.

.

.

.

1 (non-missing)

.

.

 

I want to have the following output:

 

Column X 

.

.

1 (previous 5 missing values are filled our with non-missing)

1 (previous 5 missing values are filled our with non-missing)

1 (previous 5 missing values are filled our with non-missing)

1 (previous 5 missing values are filled our with non-missing)

1 (previous 5 missing values are filled our with non-missing)

1 (non-missing)

1 (next one missing value is filled out with non-missing)

.

 

Thanks!

 

  

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

First let's assume those ambiguous date strings are in DMY order since if they are in MDY order none of them are within 5 days of each other.

data have;
  input date :ddmmyy. X ;
  format date yymmdd10.;
cards;
01.01.2022 .
02.01.2022 .
03.01.2022 .
04.01.2022 .
05.01.2022 .
06.01.2022 .
07.01.2022 1 
08.01.2022 .
09.01.2022 .
;

Since there are a limited number of days you need to check you can just use a temporary array to solve this problem.  Use the date value as the index into the array. Since the array statement wants integer constants you can use macro code to generate the lower and upper bounds.  Just pick a range you know is at least as large as your actual date range.

data want;
  array all[%sysfunc(mdy(1,1,2022)):%sysfunc(mdy(12,31,2022))] _temporary_;
  if _n_=1 then do until(eof1);
    set have end=eof1;
    where not missing(x);
    do date=date-5 to date+1;
      all[date]=X;
    end;
  end;
  set have ;
  x=coalesce(x,all[date]);
run;

 Result

Obs          date    X

 1     2022-01-01    .
 2     2022-01-02    1
 3     2022-01-03    1
 4     2022-01-04    1
 5     2022-01-05    1
 6     2022-01-06    1
 7     2022-01-07    1
 8     2022-01-08    1
 9     2022-01-09    .

View solution in original post

12 REPLIES 12
Astounding
PROC Star

What should the result be when your rules suggest a conflict?  For example, what if column X contains:

.
.
.
1
.
.
.
2
3
.
.
Khaladdin
Quartz | Level 8
Thanks,

In this case:

1
1
1
1 (non-missing)
1
2
2
2 (non-missing)
3 (non-missing)
3
.

Also, there are only a few conflicting cases.



PaigeMiller
Diamond | Level 26

In order to write code that works, there must be no conflicting cases in the description ... or you describe clearly and unambiguously what to do when there are conflicts.

 

Also, how large is the REAL data set you are working with (number of rows, number of columns)? The solution will have to scale well to the real data set and not take forever and use up all disk space available; writing code on your 10 lines or so may or may not produce scalable code.

--
Paige Miller
Khaladdin
Quartz | Level 8
I just checked the data. There are no conflicts.
The only important issue is I need to fill out the missing values. That is it.
Khaladdin
Quartz | Level 8

I would like to provide additional details to clarify my question. My apologies for any inconvenience caused.

My dataset comprises various events, all recorded in Column X. This column contains a value when an event occurs; if no event occurs, Column X is left blank. For a more comprehensive understanding, please refer to the following data:

 

Date                    Column X

01.01.2022         .

02.01.2022        .

03.01.2022        .

04.01.2022       .

05.01.2022       .

06.01.2022       .

07.01.2022       1 

08.01.2022      .

09.01.2022      .

 

In this dataset, an event is recorded on 07.01.2022, marked as '1' in Column X. Each event is uniquely identified, such as the second event being marked as '2' on its respective date.

I am looking to create an 'event window' for each event, encompassing 5 days prior to and 1 day after the event date. Specifically, I aim to generate data output as follows:

 

Date                    Column X

01.01.2022         .

02.01.2022        1 (pre-event window)

03.01.2022        1 (pre-event window)

04.01.2022        1 (pre-event window)

05.01.2022        1 (pre-event window)

06.01.2022        1 (pre-event window)

07.01.2022        1 (event date)

08.01.2022       1 (post-event window)

09.01.2022      .

 

Please be aware that using the date column presents challenges. This is due to some events occurring on Fridays, resulting in the next available day being Monday.

 

In terms of handling conflicts: if another event occurs 3 days before a given event, we should only use observations up to the date of this preceding event. Additionally, if two events happen on consecutive dates, we will exclude the post-event window and include only the days leading up to the first event.

 

I have about 7 million rows in the data. 

 

I hope it is now clear. 

 

Thanks everyone. 

Khaladdin
Quartz | Level 8

I would like to provide additional details to clarify my question. My apologies for any inconvenience caused.

My dataset comprises various events, all recorded in Column X. This column contains a value when an event occurs; if no event occurs, Column X is left blank. For a more comprehensive understanding, please refer to the following data:

 

Date                    Column X

01.01.2022         .

02.01.2022        .

03.01.2022        .

04.01.2022       .

05.01.2022       .

06.01.2022       .

07.01.2022       1 

08.01.2022      .

09.01.2022      .

 

In this dataset, an event is recorded on 07.01.2022, marked as '1' in Column X. Each event is uniquely identified, such as the second event being marked as '2' on its respective date.

I am looking to create an 'event window' for each event, encompassing 5 days prior to and 1 day after the event date. Specifically, I aim to generate data output as follows:

 

Date                    Column X

01.01.2022         .

02.01.2022        1 (pre-event window)

03.01.2022        1 (pre-event window)

04.01.2022        1 (pre-event window)

05.01.2022        1 (pre-event window)

06.01.2022        1 (pre-event window)

07.01.2022        1 (event date)

08.01.2022       1 (post-event window)

09.01.2022      .

 

Please be aware that using the date column presents challenges. This is due to some events occurring on Fridays, resulting in the next available day being Monday.

 

In terms of handling conflicts: if another event occurs 3 days before a given event, we should only use observations up to the date of this preceding event. Additionally, if two events happen on consecutive dates, we will exclude the post-event window and include only the days leading up to the first event.

 

I hope it is now clear. 

 

Thanks everyone. 

 

sbxkoenk
SAS Super FREQ

You can do this nicely with PROC TIMEDATA.
That's because the entire time series (of the by-group) is loaded into memory. That way, you can easily navigate the time series (singular or plural) and retrieve values. If you are at time "t", you can very easily retrieve value from [t-7] or from [t+3]. Very handy. So you can look backwards and also forwards. The latter is not always easy in a data step.

For multiple time series, it's like being in a 2-dimensional matrix and being able to address each cell.

 

PROC TIMEDATA is the data step for time series !!

(you need SAS Econometrics or SAS/ETS to have this procedure available)

 

BR, Koen

mkeintz
PROC Star

So, per your revised sample data, if you have a variable DATE for the entire history of events and non-events, and the variable X, then:

 

data have; 
  date=_n_+'01sep2023'd;  format date date9.;
  input x @@;
datalines;  /*X values placed with at least 6 missings between them */
. . . . . 1 .
. . . . . . . . 2 .
. . . . . . 3 . .
run;

data want (drop=_:);
  merge have
        have (firstobs=6 keep=x rename=(x=_x_advance));

  if _x_advance^=. then do;
    _countdown=5;
    _x_replace=_x_advance;
  end;
  retain _x_replace _countdown;

  x=  coalesce(x,_x_replace,lag(x));
  if _countdown>0 then _countdown=_countdown-1;
  else if _countdown=0 then call missing(_x_replace);
run;

This works as long as you have 6 or more missing X's between events.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

First let's assume those ambiguous date strings are in DMY order since if they are in MDY order none of them are within 5 days of each other.

data have;
  input date :ddmmyy. X ;
  format date yymmdd10.;
cards;
01.01.2022 .
02.01.2022 .
03.01.2022 .
04.01.2022 .
05.01.2022 .
06.01.2022 .
07.01.2022 1 
08.01.2022 .
09.01.2022 .
;

Since there are a limited number of days you need to check you can just use a temporary array to solve this problem.  Use the date value as the index into the array. Since the array statement wants integer constants you can use macro code to generate the lower and upper bounds.  Just pick a range you know is at least as large as your actual date range.

data want;
  array all[%sysfunc(mdy(1,1,2022)):%sysfunc(mdy(12,31,2022))] _temporary_;
  if _n_=1 then do until(eof1);
    set have end=eof1;
    where not missing(x);
    do date=date-5 to date+1;
      all[date]=X;
    end;
  end;
  set have ;
  x=coalesce(x,all[date]);
run;

 Result

Obs          date    X

 1     2022-01-01    .
 2     2022-01-02    1
 3     2022-01-03    1
 4     2022-01-04    1
 5     2022-01-05    1
 6     2022-01-06    1
 7     2022-01-07    1
 8     2022-01-08    1
 9     2022-01-09    .

Ksharp
Super User
data have;
input x;
cards;
.
.
.
1
.
.
.
2
3
.
.
;
data have;
 set have;
 id+1;
run;
data temp1;
 set have;
 retain next;
 if not missing(x) then next=x;
 if missing(x) then n+1;
  else n=0;
run;
data temp1;
 set temp1;
 if n ne 1 then call missing(next);
run;
/******************/
proc sort data=have out=temp2;
by descending id;
run;
data temp2;
 set temp2;
 retain lag;
 if not missing(x) then lag=x;
 if missing(x) then n+1;
  else n=0;
run;
data temp2;
 set temp2;
 if n not in (1:5) then call missing(lag);
run;
proc sort data=temp2;
by  id;
run;
/*******************/
data want;
 merge temp1(keep=x next) temp2(keep=x lag);
want=coalesce(x,next,lag);
drop next lag;
run;
Khaladdin
Quartz | Level 8

Thank you! It is also correct!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2298 views
  • 5 likes
  • 8 in conversation