DATA Step, Macro, Functions and more

Fill missing values with average of previous/next values

Accepted Solution Solved
Reply
Contributor
Posts: 38
Accepted Solution

Fill missing values with average of previous/next values

Hello, 

 

I'd like some help filling out missing values. 

 

I have a dataset with daily values of temperature and has some missing values. I want to fill them if they meet certain conditions.

 

If there are non-missing values within 7 days before and 7 days after, use the average of the closest previous/next day to missing as the imputed value. If the closest non-missing value is outside this 7 day window (before or after), then keep as missing. If the first or last day of the data collected is missing, keep as missing. 

 

A few caveats I can think of are: 

-filling in these values must be done BY zipcode.

-sometimes there are several consecutive missing values

-this is a big dataset, so the more efficient, the better

 

The way I am currently (not successfully) going about it is: 

get nearest before value (var1), nearest after value (var2), and average these --> but dont know how to account for the 7 day condition

 

 

Thank you!

 

 

data have; 
input date zip temp; 
datalines; 
jan1  90001 50
jan2  90001 51
jan3  90001 53
jan4  90001 . 
jan5  90001 49
jan6  90001 . 
jan7  90001 . 
jan8  90001 . 
jan9  90001 50
jan10 90001 55
;
run; 

data want; 
input input date zip temp temp_new; 
datalines; 
jan1  90001 50 50
jan2  90001 51 51
jan3  90001 53 53
jan4  90001 .  51
jan5  90001 49 49
jan6  90001 .  49.5
jan7  90001 .  49.5
jan8  90001 .  49.5
jan9  90001 50 50
jan10 90001 55 55
;
run

 

 

 


Accepted Solutions
Solution
a week ago
Super User
Posts: 10,046

Re: Fill missing values with average of previous/next values

Posted in reply to pamplemouse22

OK. That would be more simple.

 

data have; 
input date : anydtdte. zip temp; 
format date date9.;
datalines; 
2000jan1  90001 50
2000jan2  90001 51
2000jan3  90001 53
2000jan4  90001 . 
2000jan5  90001 49
2000jan6  90001 . 
2000jan7  90001 . 
2000jan8  90001 . 
2000jan9  90001 50
2000jan10 90001 55
;
run; 


data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h(dataset:'have',hashexp:20);
  h.definekey('zip','date');
  h.definedata('temp');
  h.definedone();
 end;
set have;
if missing(temp) then do;
 do i=1 to 7;
 _date=date-i;
 call missing(temp);
 if h.find(key:zip,key:_date)=0 and not missing(temp) then do;lag=temp;leave;end;
 end;

 do i=1 to 7;
 _date=date+i;
 call missing(temp);
 if h.find(key:zip,key:_date)=0 and not missing(temp) then do;next=temp;leave;end;
 end;

 temp=(lag+next)/2;
end;
format _date date9.;
drop  lag next i _date;
run;

View solution in original post


All Replies
Super Contributor
Posts: 298

Re: Fill missing values with average of previous/next values

Posted in reply to pamplemouse22

Can you provide sample data for two zip codes each having 8 - 10 observations to check for 7-day window? Show the resulting output you need from out of that data.

Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

 

here is sample data of what i have and want. 

 

data have; 
input date zip temp; 
datalines; 
01Jan2005 90001 50
02Jan2005 90001	51
03Jan2005 90001	52
04Jan2005 90001	49
05Jan2005 90001	48
06Jan2005 90001	.
07Jan2005 90001	50
08Jan2005 90001	51
09Jan2005 90001	52
10Jan2005 90001	49
11Jan2005 90001	54
12Jan2005 90001	60
13Jan2005 90001	50
14Jan2005 90001	.
15Jan2005 90001	52
16Jan2005 90001	49
17Jan2005 90001	.
18Jan2005 90001	60
19Jan2005 90001	50
20Jan2005 90001	51
01Jan2005 90002	60
02Jan2005 90002	51
03Jan2005 90002	52
04Jan2005 90002	52
05Jan2005 90002	52
06Jan2005 90002	.
07Jan2005 90002	.
08Jan2005 90002	.
09Jan2005 90002	.
10Jan2005 90002	.
11Jan2005 90002	.
12Jan2005 90002	.
13Jan2005 90002	.
14Jan2005 90002	.
15Jan2005 90002	52
16Jan2005 90002	49
17Jan2005 90002	52
18Jan2005 90002	49
19Jan2005 90002	48
20Jan2005 90002	60
;
run; 


data want; 
input date zip temp temp_new; 
datalines; 
01Jan2005 90001 50  
02Jan2005 90001	51  
03Jan2005 90001	52  
04Jan2005 90001	49  
05Jan2005 90001	48  
06Jan2005 90001	. 49
07Jan2005 90001	50  
08Jan2005 90001	51  
09Jan2005 90001	52  
10Jan2005 90001	49
11Jan2005 90001	54
12Jan2005 90001	60
13Jan2005 90001	50
14Jan2005 90001	. 51
15Jan2005 90001	52
16Jan2005 90001	49
17Jan2005 90001	. 54.5
18Jan2005 90001	60
19Jan2005 90001	50
20Jan2005 90001	51
01Jan2005 90002	60
02Jan2005 90002	51
03Jan2005 90002	52
04Jan2005 90002	52
05Jan2005 90002	52
06Jan2005 90002	. .
07Jan2005 90002	. .
08Jan2005 90002	. 52
09Jan2005 90002	. 52
10Jan2005 90002	. 52
11Jan2005 90002	. 52
12Jan2005 90002	. 52
13Jan2005 90002	. .
14Jan2005 90002	. .
15Jan2005 90002	52
16Jan2005 90002	49
17Jan2005 90002	52
18Jan2005 90002	49
19Jan2005 90002	48
20Jan2005 90002	60
;
run; 
Super Contributor
Posts: 298

Re: Fill missing values with average of previous/next values

[ Edited ]
Posted in reply to pamplemouse22

You have several ways to do your job. But you indicate that you are not getting the expected output. 

 

In your sample data set, you have not provided one condition to be checked. You want the TEMP with a missing value for the first and last date of the ZIP be retained as such. So I have modified your input data for checking the condition :

 

data have;
INFILE CARDS EXPANDTABS; 
input date date9. zip temp; 
cards; 
01Jan2005 90001   .
02Jan2005 90001   51
03Jan2005 90001   52
04Jan2005 90001   49
05Jan2005 90001   48
06Jan2005 90001   .
07Jan2005 90001   50
08Jan2005 90001   51
09Jan2005 90001   52
10Jan2005 90001   49
11Jan2005 90001   54
12Jan2005 90001   60
13Jan2005 90001   50
14Jan2005 90001   .
15Jan2005 90001   52
16Jan2005 90001   49
17Jan2005 90001   .
18Jan2005 90001   60
19Jan2005 90001   50
20Jan2005 90001    .
01Jan2005 90002   60
02Jan2005 90002   51
03Jan2005 90002   52
04Jan2005 90002   52
05Jan2005 90002   52
06Jan2005 90002   .
07Jan2005 90002   .
08Jan2005 90002   .
09Jan2005 90002   .
10Jan2005 90002   .
11Jan2005 90002   .
12Jan2005 90002   .
13Jan2005 90002   .
14Jan2005 90002   .
15Jan2005 90002   52
16Jan2005 90002   49
17Jan2005 90002   52
18Jan2005 90002   49
19Jan2005 90002   48
20Jan2005 90002   .
;
run; 

Next, we scan the DATA SET to get the SMALLEST and the BIGGEST Date to SIZE the Array to be used in the following Program. If you know these before hand, you can skip this step but you need to create Macro variables for them (bDate, eDate). The array is sized for the biggest group. So, small groups will also fit into it.

 

data _null_;
   retain eDate 0 sDate 999999999;
 
   set have end = eof;
   if date > eDate then eDate = date;
   if date < sDate then sDate = date;
   if eof then do;
      call symputx('sDate', sDate);
      call symputx('eDate', eDate);
   end;
run;

The program uses two DO-LOOPs. This do-loop is known as DoW-loop. In the first do-loop, TEMP for each DATE is saved in a _temporary_ array. Only one Zip at a time is processed. We save the first DATE and the last DATE to check your boundary conditions. In the second do-loop, the same group is scanned again and checked for your requirements. Here is the program:

 


options fullstimer;

data want; array k[&sDate:&eDate] _temporary_; do until(last.zip); set have; by zip; k[date] = temp; if first.zip then startD = date; if last.zip then endD = date; end; do until(last.zip); set have; by zip; if (date = startD & temp = .) | (date = endD & temp = .) then temp_new = .; else if not missing(temp) then temp_new = temp; else do; tf = .; td = .; do m = date+1 to date+7; * going forward; if k[m] > 0 then do; tf = k[m]; leave; end; end; do m = date - 1 to date - 7 by -1; * going backward; if k[m] > 0 then do; td = k[m]; leave; end; end; if missing(tf) or missing(td) then temp_new = .; else temp_new = (tf + td) * 0.5; end; output; end; keep date zip temp temp_new; run;

Here is the OUTPUT generated by this program:

 

Obs	date	zip	temp	temp_new
1	16437	90001	50	.
2	16438	90001	51	51.0
3	16439	90001	52	52.0
4	16440	90001	49	49.0
5	16441	90001	48	48.0
6	16442	90001	.	49.0
7	16443	90001	50	50.0
8	16444	90001	51	51.0
9	16445	90001	52	52.0
10	16446	90001	49	49.0
11	16447	90001	54	54.0
12	16448	90001	60	60.0
13	16449	90001	50	50.0
14	16450	90001	.	51.0
15	16451	90001	52	52.0
16	16452	90001	49	49.0
17	16453	90001	.	54.5
18	16454	90001	60	60.0
19	16455	90001	50	50.0
20	16456	90001	51	.
21	16437	90002	60	.
22	16438	90002	51	51.0
23	16439	90002	52	52.0
24	16440	90002	52	52.0
25	16441	90002	52	52.0
26	16442	90002	.	.
27	16443	90002	.	.
28	16444	90002	.	52.0
29	16445	90002	.	52.0
30	16446	90002	.	52.0
31	16447	90002	.	52.0
32	16448	90002	.	52.0
33	16449	90002	.	.
34	16450	90002	.	.
35	16451	90002	52	52.0
36	16452	90002	49	49.0
37	16453	90002	52	52.0
38	16454	90002	49	49.0
39	16455	90002	48	48.0
40	16456	90002	60	.

You can compare all the methods and see how FULLSTIMER gives Time and Memory. Enjoy.

PROC Star
Posts: 1,760

Re: Fill missing values with average of previous/next values

[ Edited ]
Posted in reply to pamplemouse22

Like this?

 

data HAVE; 
input DATE date7. ZIP TEMP; 
datalines; 
01jan17 90001 50
02jan17 90001 51
03jan17 90001 53
04jan17 90001 . 
05jan17 90001 49
06jan17 90001 . 
07jan17 90001 . 
08jan17 90001 . 
09jan17 90001 50
10jan17 90001 55
10feb17 90001 .
11feb17 90001 55
10mar17 90001 .
20mar17 90001 55
run; 
data WANT;                           
  array ZIPS [%sysevalf('01jan2010'd):%sysevalf('01may2019'd)] _temporary_;
  array TEMPS[%sysevalf('01jan2010'd):%sysevalf('01may2019'd)] _temporary_;
  if _N_=1 then do until(LASTOBS);
    set HAVE end=LASTOBS;
    ZIPS [DATE]=ZIP ;
    TEMPS[DATE]=TEMP;
  end;
  set HAVE;
  by ZIP;
  if TEMP=. then do;
    do I=DATE-1 to DATE-7 by -1;
      if ZIPS[I] ne ZIP then leave;
      if TEMPS[I] then LAST_TEMP=TEMPS[I];
      if LAST_TEMP then leave;
    end;
    do I=DATE+1 to DATE+7;
      if ZIPS[I] ne ZIP then leave;
      if TEMPS[I] then NEXT_TEMP=TEMPS[I];
      if NEXT_TEMP then leave;
    end;
    if ^first.ZIP and ^last.ZIP then TEMP=mean(LAST_TEMP,NEXT_TEMP);
  end;
run;

 

DATE ZIP TEMP
20820 90001 50.0
20821 90001 51.0
20822 90001 53.0
20823 90001 51.0
20824 90001 49.0
20825 90001 49.5
20826 90001 49.5
20827 90001 49.5
20828 90001 50.0
20829 90001 55.0
20860 90001 55.0
20861 90001 55.0
20888 90001 .
20898 90001 55.0

It should be fast as it just performs 2 sequential reads of the table.

Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

[ Edited ]

hello, 

 

thank you for this code! i tried the following and am getting an error due to array being out of range. i think it might be because of missing values either at the start or end date. below is the code and error.

 

 

445  data WANT;
446    array ZIPS [%sysevalf('01jan2005'd):%sysevalf('31dec2010'd)] _temporary_;
447    array TEMPS[%sysevalf('01jan2005'd):%sysevalf('31dec2010'd)] _temporary_;
448    if _N_=1 then do until(LASTOBS);
449      set test end=LASTOBS;
450      ZIPS [DATE]=zipnum ;
451      TEMPS[DATE]=meantemp;
452    end;
453    set test;
454    by zipnum;
455    if meantemp=. then do;
456      do I=DATE-1 to DATE-7 by -1;
457        if ZIPS[I] ne zipnum then leave;
458        if TEMPS[I] then LAST_TEMP=TEMPS[I];
459        if LAST_TEMP then leave;
460      end;
461      do I=DATE+1 to DATE+7;
462        if ZIPS[I] ne zipnum then leave;
463        if TEMPS[I] then NEXT_TEMP=TEMPS[I];
464        if NEXT_TEMP then leave;
465      end;
466      if ^first.zipnum and ^last.zipnum then meantemp=mean(LAST_TEMP,NEXT_TEMP);
467    end;
468  run;

ERROR: Array subscript out of range at line 462 column 10.
LASTOBS=1 date=31DEC2010 zipnum=90004 meantemp=. FIRST.zipnum=0 LAST.zipnum=1 I=18628 LAST_TEMP=.
NEXT_TEMP=. _ERROR_=1 _N_=8764
NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      380 at 466:53
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 3615150 observations read from the data set WORK.TEST.
NOTE: There were 8765 observations read from the data set WORK.TEST.
WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 8763
         observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.19 seconds
      cpu time            0.15 seconds

 

PROC Star
Posts: 1,760

Re: Fill missing values with average of previous/next values

Posted in reply to pamplemouse22

you look 7 days  before jan2005 so your array must start before too.

PROC Star
Posts: 1,760

Re: Fill missing values with average of previous/next values

I gave you a solution that works.

What do you change everything if you don't know array syntax?

If you must, start from the working solution, and change one thing at a time ("oh i'll try to start exactly on 01jan2005", or "oh i'll try curly brackets").

You are giving yourself (and us) more work than needed.

Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

okay, i will try that. thank you for your help.
Trusted Advisor
Posts: 1,022

Re: Fill missing values with average of previous/next values

[ Edited ]
Posted in reply to pamplemouse22

 

Using a SET statement with dataset HAVE specified as the argument twice, followed with a BY statement, provides a compact way to avoid the double "do until (last.zip)" approach:

 

 

data want (drop=_d);
  set have (in=firstpass)
      have (in=secondpass);
  by zip;

  array thist{%sysevalf{'01jan2016'd-7}:%sysevalf{'31dec2016'd+7} _temporary_;
  if first.zip then call missing(of thist{*});
  if firstpass then thist{date}=temp;

  if secondpass;
  if temp=. and lag(zip)^=zip and last.zip=0 then do _d=1 to 7 while(temp=.);
    temp=mean(thist{date - _d},thist{date + _d});
  end;
run;

Notes:

 

  1. Dataset is sorted by ZIP/DATE, but use "BY ZIP", not "BY ZIP DATE".
  2. For the THIST array put the  earliest and lastest expected dates as the array bounds.
  3. The "if temp=. and lag(zip)^=zip and last.zip=0" filter select all cases with TEMP=. except  the earliest and lastest date.
Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

thank you! tried the following code, and get the following errors. Ideas? I don't understand the array well enough to figure out how to debug this. looking into it now. 

 

proc sort data=have; by zipnum; run; 
data want (drop=_d);
  set have (in=firstpass)
      have (in=secondpass);
  by zipnum;

  array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_;
  if first.zipnum then call missing(of thist{*});
  if firstpass then thist{date}=meantemp;

  if secondpass;
  if meantemp=. and lag(zipnum)^=zip and last.zipum=0 then do _d=1 to 7 while(meantemp=.);
    meantemp=mean(thist{date - _d},thist{date + _d});
  end;
run;
181  data test2 (drop=_d);
182    set test (in=firstpass)
183        test (in=secondpass);
184    by zipnum;
185
186    array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_;
                   -
                   22
                   76
ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses.
ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses.
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

187    if first.zipnum then call missing(of thist{*});
188    if firstpass then thist{date}=meantemp;
ERROR: Too many array subscripts specified for array thist.
189
190    if secondpass;
191    if meantemp=. and lag(zipnum)^=zip and last.zipum=0 then do _d=1 to 7 while(meantemp=.);
192      meantemp=mean(thist{date - _d},thist{date + _d});
ERROR: Too many array subscripts specified for array thist.
ERROR: Too many array subscripts specified for array thist.
193    end;
194  run;

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      192:19   192:36
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEST2 may be incomplete.  When this step was stopped there were 0
         observations and 4 variables.
WARNING: Data set WORK.TEST2 was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

Super User
Super User
Posts: 7,080

Re: Fill missing values with average of previous/next values

[ Edited ]
Posted in reply to pamplemouse22

Your error message

186    array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_;
                   -
                   22
                   76
ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses.
ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses.
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

You cannot use curly brackets in function calls, use normal parentheses.

%sysevalf('01jan2005'd-7)

NOTE: you don't NEED  to use square [] or curly {} brackets anywhere in base SAS, you can just use normal () parentheses for everything. 

array thist( %sysevalf('01jan2005'd-7):%sysevalf('31dec2010'd+7)) _temporary_;

I find it makes typing much easier. Plus if your fonts are too small it is really hard to tell the difference. Never got in the habit of using them anyway since they caused havoc in the old days when using ASCII terminals to connect to IBM mainframes.

 

Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

oh, got it, thanks! it runs without that error now .
Contributor
Posts: 38

Re: Fill missing values with average of previous/next values

second attempt. tried this. no errors, but new dataset looks same as original. 

 

513  proc sort data=test; by zipnum; run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


514  data want (drop=_d);
515    set test (in=firstpass)
516        test (in=secondpass);
517    by zipnum;
518
519    array thist(%sysevalf('01jan2005'd-7):%sysevalf('31dec2010'd+7)) _temporary_;
520    if first.zipnum then call missing(of thist{*});
521    if firstpass then thist{date}=meantemp;
522
523    if secondpass;
524    if meantemp=. and lag(zipnum)^=zipnum and last.zipnum=0 then do _d=1 to 7 while(meantemp=.);
525      meantemp=mean(thist{date - _d},thist{date + _d});
526    end;
527  run;

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      4172 at 525:14
NOTE: There were 3615150 observations read from the data set WORK.TEST.
NOTE: There were 3615150 observations read from the data set WORK.TEST.
NOTE: The data set WORK.WANT has 3615150 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.74 seconds
      cpu time            0.71 seconds


 

 

Super User
Posts: 10,046

Re: Fill missing values with average of previous/next values

Posted in reply to pamplemouse22

How about this one .

 

data have; 
input date $ zip temp; 
datalines; 
jan1  90001 50
jan2  90001 51
jan3  90001 53
jan4  90001 . 
jan5  90001 49
jan6  90001 . 
jan7  90001 . 
jan8  90001 . 
jan9  90001 50
jan10 90001 55
;
run; 

data key;
 set have;
 by temp notsorted;
 k+first.temp;
run;

data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key',hashexp:20);
  h.definekey('k');
  h.definedata('temp');
  h.definedone();
 end;
set key;
if missing(temp) then do;
 call missing(temp);
 rc=h.find(key:k-1);
 lag=temp;

 call missing(temp);
 rc=h.find(key:k+1);
 next=temp;

 temp=mean(lag,next);
end;
drop rc lag next k;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 21 replies
  • 237 views
  • 1 like
  • 7 in conversation