Solved: Re: Filling missing values

Khaladdin · Posted 01-14-2024 05:31 AM

Hi all,

Happy New Year!

For each non-missing value in Column X, I want to fill the previous 5 and the next 1 observation with this non-missing value.

In short, I have the following dataset:

Column X

.

1 (non-missing)

.

I want to have the following output:

Column X

.

1 (previous 5 missing values are filled our with non-missing)

1 (non-missing)

1 (next one missing value is filled out with non-missing)

.

Thanks!

Tom · Posted 01-14-2024 11:45 AM

First let's assume those ambiguous date strings are in DMY order since if they are in MDY order none of them are within 5 days of each other.

data have;
  input date :ddmmyy. X ;
  format date yymmdd10.;
cards;
01.01.2022 .
02.01.2022 .
03.01.2022 .
04.01.2022 .
05.01.2022 .
06.01.2022 .
07.01.2022 1 
08.01.2022 .
09.01.2022 .
;

Since there are a limited number of days you need to check you can just use a temporary array to solve this problem. Use the date value as the index into the array. Since the array statement wants integer constants you can use macro code to generate the lower and upper bounds. Just pick a range you know is at least as large as your actual date range.

data want;
  array all[%sysfunc(mdy(1,1,2022)):%sysfunc(mdy(12,31,2022))] _temporary_;
  if _n_=1 then do until(eof1);
    set have end=eof1;
    where not missing(x);
    do date=date-5 to date+1;
      all[date]=X;
    end;
  end;
  set have ;
  x=coalesce(x,all[date]);
run;

Result

Obs          date    X

 1     2022-01-01    .
 2     2022-01-02    1
 3     2022-01-03    1
 4     2022-01-04    1
 5     2022-01-05    1
 6     2022-01-06    1
 7     2022-01-07    1
 8     2022-01-08    1
 9     2022-01-09    .

View solution in original post

Astounding · Posted 01-14-2024 07:13 AM

What should the result be when your rules suggest a conflict? For example, what if column X contains:

.
.
.
1
.
.
.
2
3
.
.

Khaladdin · Posted 01-14-2024 07:21 AM

Thanks,

In this case:

1
1
1
1 (non-missing)
1
2
2
2 (non-missing)
3 (non-missing)
3
.

Also, there are only a few conflicting cases.

PaigeMiller · Posted 01-14-2024 07:25 AM

In order to write code that works, there must be no conflicting cases in the description ... or you describe clearly and unambiguously what to do when there are conflicts.

Also, how large is the REAL data set you are working with (number of rows, number of columns)? The solution will have to scale well to the real data set and not take forever and use up all disk space available; writing code on your 10 lines or so may or may not produce scalable code.

--
Paige Miller

Khaladdin · Posted 01-14-2024 07:31 AM

I just checked the data. There are no conflicts.
The only important issue is I need to fill out the missing values. That is it.

Khaladdin · Posted 01-14-2024 08:47 AM

I would like to provide additional details to clarify my question. My apologies for any inconvenience caused.

My dataset comprises various events, all recorded in Column X. This column contains a value when an event occurs; if no event occurs, Column X is left blank. For a more comprehensive understanding, please refer to the following data:

Date Column X

01.01.2022 .

02.01.2022 .

03.01.2022 .

04.01.2022 .

05.01.2022 .

06.01.2022 .

07.01.2022 1

08.01.2022 .

09.01.2022 .

In this dataset, an event is recorded on 07.01.2022, marked as '1' in Column X. Each event is uniquely identified, such as the second event being marked as '2' on its respective date.

I am looking to create an 'event window' for each event, encompassing 5 days prior to and 1 day after the event date. Specifically, I aim to generate data output as follows:

Date Column X

01.01.2022 .

02.01.2022 1 (pre-event window)

03.01.2022 1 (pre-event window)

04.01.2022 1 (pre-event window)

05.01.2022 1 (pre-event window)

06.01.2022 1 (pre-event window)

07.01.2022 1 (event date)

08.01.2022 1 (post-event window)

09.01.2022 .

Please be aware that using the date column presents challenges. This is due to some events occurring on Fridays, resulting in the next available day being Monday.

In terms of handling conflicts: if another event occurs 3 days before a given event, we should only use observations up to the date of this preceding event. Additionally, if two events happen on consecutive dates, we will exclude the post-event window and include only the days leading up to the first event.

I have about 7 million rows in the data.

I hope it is now clear.

Thanks everyone.

Kurt_Bremser · Posted 01-14-2024 07:47 AM

What is the rule for retaining 1 once after the non-missing 1, and from then on have two "look-aheads" with value 2?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Khaladdin · Posted 01-14-2024 08:45 AM

I would like to provide additional details to clarify my question. My apologies for any inconvenience caused.

My dataset comprises various events, all recorded in Column X. This column contains a value when an event occurs; if no event occurs, Column X is left blank. For a more comprehensive understanding, please refer to the following data:

Date Column X

01.01.2022 .

02.01.2022 .

03.01.2022 .

04.01.2022 .

05.01.2022 .

06.01.2022 .

07.01.2022 1

08.01.2022 .

09.01.2022 .

In this dataset, an event is recorded on 07.01.2022, marked as '1' in Column X. Each event is uniquely identified, such as the second event being marked as '2' on its respective date.

I am looking to create an 'event window' for each event, encompassing 5 days prior to and 1 day after the event date. Specifically, I aim to generate data output as follows:

Date Column X

01.01.2022 .

02.01.2022 1 (pre-event window)

03.01.2022 1 (pre-event window)

04.01.2022 1 (pre-event window)

05.01.2022 1 (pre-event window)

06.01.2022 1 (pre-event window)

07.01.2022 1 (event date)

08.01.2022 1 (post-event window)

09.01.2022 .

Please be aware that using the date column presents challenges. This is due to some events occurring on Fridays, resulting in the next available day being Monday.

In terms of handling conflicts: if another event occurs 3 days before a given event, we should only use observations up to the date of this preceding event. Additionally, if two events happen on consecutive dates, we will exclude the post-event window and include only the days leading up to the first event.

I hope it is now clear.

Thanks everyone.

sbxkoenk · Posted 01-14-2024 09:37 AM

You can do this nicely with PROC TIMEDATA.
That's because the entire time series (of the by-group) is loaded into memory. That way, you can easily navigate the time series (singular or plural) and retrieve values. If you are at time "t", you can very easily retrieve value from [t-7] or from [t+3]. Very handy. So you can look backwards and also forwards. The latter is not always easy in a data step.

For multiple time series, it's like being in a 2-dimensional matrix and being able to address each cell.

PROC TIMEDATA is the data step for time series !!

(you need SAS Econometrics or SAS/ETS to have this procedure available)

Data Step for Timeseries: part 1, Overview
by chwell
https://communities.sas.com/t5/SAS-Communities-Library/Data-Step-for-Timeseries-part-1-Overview/ta-p...
Data Step for Timeseries: part 2, BY Group Processing
by chwell
https://communities.sas.com/t5/SAS-Communities-Library/Data-Step-for-Timeseries-part-2-BY-Group-Proc...

BR, Koen

mkeintz · Posted 01-15-2024 12:58 AM

So, per your revised sample data, if you have a variable DATE for the entire history of events and non-events, and the variable X, then:

data have; 
  date=_n_+'01sep2023'd;  format date date9.;
  input x @@;
datalines;  /*X values placed with at least 6 missings between them */
. . . . . 1 .
. . . . . . . . 2 .
. . . . . . 3 . .
run;

data want (drop=_:);
  merge have
        have (firstobs=6 keep=x rename=(x=_x_advance));

  if _x_advance^=. then do;
    _countdown=5;
    _x_replace=_x_advance;
  end;
  retain _x_replace _countdown;

  x=  coalesce(x,_x_replace,lag(x));
  if _countdown>0 then _countdown=_countdown-1;
  else if _countdown=0 then call missing(_x_replace);
run;

This works as long as you have 6 or more missing X's between events.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 01-14-2024 11:45 AM

First let's assume those ambiguous date strings are in DMY order since if they are in MDY order none of them are within 5 days of each other.

data have;
  input date :ddmmyy. X ;
  format date yymmdd10.;
cards;
01.01.2022 .
02.01.2022 .
03.01.2022 .
04.01.2022 .
05.01.2022 .
06.01.2022 .
07.01.2022 1 
08.01.2022 .
09.01.2022 .
;

Since there are a limited number of days you need to check you can just use a temporary array to solve this problem. Use the date value as the index into the array. Since the array statement wants integer constants you can use macro code to generate the lower and upper bounds. Just pick a range you know is at least as large as your actual date range.

data want;
  array all[%sysfunc(mdy(1,1,2022)):%sysfunc(mdy(12,31,2022))] _temporary_;
  if _n_=1 then do until(eof1);
    set have end=eof1;
    where not missing(x);
    do date=date-5 to date+1;
      all[date]=X;
    end;
  end;
  set have ;
  x=coalesce(x,all[date]);
run;

Result

Obs          date    X

 1     2022-01-01    .
 2     2022-01-02    1
 3     2022-01-03    1
 4     2022-01-04    1
 5     2022-01-05    1
 6     2022-01-06    1
 7     2022-01-07    1
 8     2022-01-08    1
 9     2022-01-09    .

Ksharp · Posted 01-15-2024 03:24 AM

data have;
input x;
cards;
.
.
.
1
.
.
.
2
3
.
.
;
data have;
 set have;
 id+1;
run;
data temp1;
 set have;
 retain next;
 if not missing(x) then next=x;
 if missing(x) then n+1;
  else n=0;
run;
data temp1;
 set temp1;
 if n ne 1 then call missing(next);
run;
/******************/
proc sort data=have out=temp2;
by descending id;
run;
data temp2;
 set temp2;
 retain lag;
 if not missing(x) then lag=x;
 if missing(x) then n+1;
  else n=0;
run;
data temp2;
 set temp2;
 if n not in (1:5) then call missing(lag);
run;
proc sort data=temp2;
by  id;
run;
/*******************/
data want;
 merge temp1(keep=x next) temp2(keep=x lag);
want=coalesce(x,next,lag);
drop next lag;
run;

Khaladdin · Posted 01-15-2024 04:09 AM

Thank you! It is also correct!

SAS Innovate 2025: Call for Content

Classroom Training Available!