Hello,
I'd like some help filling out missing values.
I have a dataset with daily values of temperature and has some missing values. I want to fill them if they meet certain conditions.
If there are non-missing values within 7 days before and 7 days after, use the average of the closest previous/next day to missing as the imputed value. If the closest non-missing value is outside this 7 day window (before or after), then keep as missing. If the first or last day of the data collected is missing, keep as missing.
A few caveats I can think of are:
-filling in these values must be done BY zipcode.
-sometimes there are several consecutive missing values
-this is a big dataset, so the more efficient, the better
The way I am currently (not successfully) going about it is:
get nearest before value (var1), nearest after value (var2), and average these --> but dont know how to account for the 7 day condition
Thank you!
data have;
input date zip temp;
datalines;
jan1 90001 50
jan2 90001 51
jan3 90001 53
jan4 90001 .
jan5 90001 49
jan6 90001 .
jan7 90001 .
jan8 90001 .
jan9 90001 50
jan10 90001 55
;
run;
data want;
input input date zip temp temp_new;
datalines;
jan1 90001 50 50
jan2 90001 51 51
jan3 90001 53 53
jan4 90001 . 51
jan5 90001 49 49
jan6 90001 . 49.5
jan7 90001 . 49.5
jan8 90001 . 49.5
jan9 90001 50 50
jan10 90001 55 55
;
run
OK. That would be more simple.
data have;
input date : anydtdte. zip temp;
format date date9.;
datalines;
2000jan1 90001 50
2000jan2 90001 51
2000jan3 90001 53
2000jan4 90001 .
2000jan5 90001 49
2000jan6 90001 .
2000jan7 90001 .
2000jan8 90001 .
2000jan9 90001 50
2000jan10 90001 55
;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h(dataset:'have',hashexp:20);
h.definekey('zip','date');
h.definedata('temp');
h.definedone();
end;
set have;
if missing(temp) then do;
do i=1 to 7;
_date=date-i;
call missing(temp);
if h.find(key:zip,key:_date)=0 and not missing(temp) then do;lag=temp;leave;end;
end;
do i=1 to 7;
_date=date+i;
call missing(temp);
if h.find(key:zip,key:_date)=0 and not missing(temp) then do;next=temp;leave;end;
end;
temp=(lag+next)/2;
end;
format _date date9.;
drop lag next i _date;
run;
Can you provide sample data for two zip codes each having 8 - 10 observations to check for 7-day window? Show the resulting output you need from out of that data.
here is sample data of what i have and want.
data have;
input date zip temp;
datalines;
01Jan2005 90001 50
02Jan2005 90001 51
03Jan2005 90001 52
04Jan2005 90001 49
05Jan2005 90001 48
06Jan2005 90001 .
07Jan2005 90001 50
08Jan2005 90001 51
09Jan2005 90001 52
10Jan2005 90001 49
11Jan2005 90001 54
12Jan2005 90001 60
13Jan2005 90001 50
14Jan2005 90001 .
15Jan2005 90001 52
16Jan2005 90001 49
17Jan2005 90001 .
18Jan2005 90001 60
19Jan2005 90001 50
20Jan2005 90001 51
01Jan2005 90002 60
02Jan2005 90002 51
03Jan2005 90002 52
04Jan2005 90002 52
05Jan2005 90002 52
06Jan2005 90002 .
07Jan2005 90002 .
08Jan2005 90002 .
09Jan2005 90002 .
10Jan2005 90002 .
11Jan2005 90002 .
12Jan2005 90002 .
13Jan2005 90002 .
14Jan2005 90002 .
15Jan2005 90002 52
16Jan2005 90002 49
17Jan2005 90002 52
18Jan2005 90002 49
19Jan2005 90002 48
20Jan2005 90002 60
;
run;
data want;
input date zip temp temp_new;
datalines;
01Jan2005 90001 50
02Jan2005 90001 51
03Jan2005 90001 52
04Jan2005 90001 49
05Jan2005 90001 48
06Jan2005 90001 . 49
07Jan2005 90001 50
08Jan2005 90001 51
09Jan2005 90001 52
10Jan2005 90001 49
11Jan2005 90001 54
12Jan2005 90001 60
13Jan2005 90001 50
14Jan2005 90001 . 51
15Jan2005 90001 52
16Jan2005 90001 49
17Jan2005 90001 . 54.5
18Jan2005 90001 60
19Jan2005 90001 50
20Jan2005 90001 51
01Jan2005 90002 60
02Jan2005 90002 51
03Jan2005 90002 52
04Jan2005 90002 52
05Jan2005 90002 52
06Jan2005 90002 . .
07Jan2005 90002 . .
08Jan2005 90002 . 52
09Jan2005 90002 . 52
10Jan2005 90002 . 52
11Jan2005 90002 . 52
12Jan2005 90002 . 52
13Jan2005 90002 . .
14Jan2005 90002 . .
15Jan2005 90002 52
16Jan2005 90002 49
17Jan2005 90002 52
18Jan2005 90002 49
19Jan2005 90002 48
20Jan2005 90002 60
;
run;
You have several ways to do your job. But you indicate that you are not getting the expected output.
In your sample data set, you have not provided one condition to be checked. You want the TEMP with a missing value for the first and last date of the ZIP be retained as such. So I have modified your input data for checking the condition :
data have;
INFILE CARDS EXPANDTABS;
input date date9. zip temp;
cards;
01Jan2005 90001 .
02Jan2005 90001 51
03Jan2005 90001 52
04Jan2005 90001 49
05Jan2005 90001 48
06Jan2005 90001 .
07Jan2005 90001 50
08Jan2005 90001 51
09Jan2005 90001 52
10Jan2005 90001 49
11Jan2005 90001 54
12Jan2005 90001 60
13Jan2005 90001 50
14Jan2005 90001 .
15Jan2005 90001 52
16Jan2005 90001 49
17Jan2005 90001 .
18Jan2005 90001 60
19Jan2005 90001 50
20Jan2005 90001 .
01Jan2005 90002 60
02Jan2005 90002 51
03Jan2005 90002 52
04Jan2005 90002 52
05Jan2005 90002 52
06Jan2005 90002 .
07Jan2005 90002 .
08Jan2005 90002 .
09Jan2005 90002 .
10Jan2005 90002 .
11Jan2005 90002 .
12Jan2005 90002 .
13Jan2005 90002 .
14Jan2005 90002 .
15Jan2005 90002 52
16Jan2005 90002 49
17Jan2005 90002 52
18Jan2005 90002 49
19Jan2005 90002 48
20Jan2005 90002 .
;
run;
Next, we scan the DATA SET to get the SMALLEST and the BIGGEST Date to SIZE the Array to be used in the following Program. If you know these before hand, you can skip this step but you need to create Macro variables for them (bDate, eDate). The array is sized for the biggest group. So, small groups will also fit into it.
data _null_;
retain eDate 0 sDate 999999999;
set have end = eof;
if date > eDate then eDate = date;
if date < sDate then sDate = date;
if eof then do;
call symputx('sDate', sDate);
call symputx('eDate', eDate);
end;
run;
The program uses two DO-LOOPs. This do-loop is known as DoW-loop. In the first do-loop, TEMP for each DATE is saved in a _temporary_ array. Only one Zip at a time is processed. We save the first DATE and the last DATE to check your boundary conditions. In the second do-loop, the same group is scanned again and checked for your requirements. Here is the program:
options fullstimer;
data want;
array k[&sDate:&eDate] _temporary_;
do until(last.zip);
set have;
by zip;
k[date] = temp;
if first.zip then startD = date;
if last.zip then endD = date;
end;
do until(last.zip);
set have;
by zip;
if (date = startD & temp = .) | (date = endD & temp = .) then temp_new = .;
else if not missing(temp) then temp_new = temp;
else do;
tf = .; td = .;
do m = date+1 to date+7; * going forward;
if k[m] > 0 then do; tf = k[m]; leave; end;
end;
do m = date - 1 to date - 7 by -1; * going backward;
if k[m] > 0 then do; td = k[m]; leave; end;
end;
if missing(tf) or missing(td) then temp_new = .;
else temp_new = (tf + td) * 0.5;
end;
output;
end;
keep date zip temp temp_new;
run;
Here is the OUTPUT generated by this program:
Obs date zip temp temp_new
1 16437 90001 50 .
2 16438 90001 51 51.0
3 16439 90001 52 52.0
4 16440 90001 49 49.0
5 16441 90001 48 48.0
6 16442 90001 . 49.0
7 16443 90001 50 50.0
8 16444 90001 51 51.0
9 16445 90001 52 52.0
10 16446 90001 49 49.0
11 16447 90001 54 54.0
12 16448 90001 60 60.0
13 16449 90001 50 50.0
14 16450 90001 . 51.0
15 16451 90001 52 52.0
16 16452 90001 49 49.0
17 16453 90001 . 54.5
18 16454 90001 60 60.0
19 16455 90001 50 50.0
20 16456 90001 51 .
21 16437 90002 60 .
22 16438 90002 51 51.0
23 16439 90002 52 52.0
24 16440 90002 52 52.0
25 16441 90002 52 52.0
26 16442 90002 . .
27 16443 90002 . .
28 16444 90002 . 52.0
29 16445 90002 . 52.0
30 16446 90002 . 52.0
31 16447 90002 . 52.0
32 16448 90002 . 52.0
33 16449 90002 . .
34 16450 90002 . .
35 16451 90002 52 52.0
36 16452 90002 49 49.0
37 16453 90002 52 52.0
38 16454 90002 49 49.0
39 16455 90002 48 48.0
40 16456 90002 60 .
You can compare all the methods and see how FULLSTIMER gives Time and Memory. Enjoy.
Like this?
data HAVE;
input DATE date7. ZIP TEMP;
datalines;
01jan17 90001 50
02jan17 90001 51
03jan17 90001 53
04jan17 90001 .
05jan17 90001 49
06jan17 90001 .
07jan17 90001 .
08jan17 90001 .
09jan17 90001 50
10jan17 90001 55
10feb17 90001 .
11feb17 90001 55
10mar17 90001 .
20mar17 90001 55
run;
data WANT;
array ZIPS [%sysevalf('01jan2010'd):%sysevalf('01may2019'd)] _temporary_;
array TEMPS[%sysevalf('01jan2010'd):%sysevalf('01may2019'd)] _temporary_;
if _N_=1 then do until(LASTOBS);
set HAVE end=LASTOBS;
ZIPS [DATE]=ZIP ;
TEMPS[DATE]=TEMP;
end;
set HAVE;
by ZIP;
if TEMP=. then do;
do I=DATE-1 to DATE-7 by -1;
if ZIPS[I] ne ZIP then leave;
if TEMPS[I] then LAST_TEMP=TEMPS[I];
if LAST_TEMP then leave;
end;
do I=DATE+1 to DATE+7;
if ZIPS[I] ne ZIP then leave;
if TEMPS[I] then NEXT_TEMP=TEMPS[I];
if NEXT_TEMP then leave;
end;
if ^first.ZIP and ^last.ZIP then TEMP=mean(LAST_TEMP,NEXT_TEMP);
end;
run;
DATE | ZIP | TEMP |
---|---|---|
20820 | 90001 | 50.0 |
20821 | 90001 | 51.0 |
20822 | 90001 | 53.0 |
20823 | 90001 | 51.0 |
20824 | 90001 | 49.0 |
20825 | 90001 | 49.5 |
20826 | 90001 | 49.5 |
20827 | 90001 | 49.5 |
20828 | 90001 | 50.0 |
20829 | 90001 | 55.0 |
20860 | 90001 | 55.0 |
20861 | 90001 | 55.0 |
20888 | 90001 | . |
20898 | 90001 | 55.0 |
It should be fast as it just performs 2 sequential reads of the table.
hello,
thank you for this code! i tried the following and am getting an error due to array being out of range. i think it might be because of missing values either at the start or end date. below is the code and error.
445 data WANT; 446 array ZIPS [%sysevalf('01jan2005'd):%sysevalf('31dec2010'd)] _temporary_; 447 array TEMPS[%sysevalf('01jan2005'd):%sysevalf('31dec2010'd)] _temporary_; 448 if _N_=1 then do until(LASTOBS); 449 set test end=LASTOBS; 450 ZIPS [DATE]=zipnum ; 451 TEMPS[DATE]=meantemp; 452 end; 453 set test; 454 by zipnum; 455 if meantemp=. then do; 456 do I=DATE-1 to DATE-7 by -1; 457 if ZIPS[I] ne zipnum then leave; 458 if TEMPS[I] then LAST_TEMP=TEMPS[I]; 459 if LAST_TEMP then leave; 460 end; 461 do I=DATE+1 to DATE+7; 462 if ZIPS[I] ne zipnum then leave; 463 if TEMPS[I] then NEXT_TEMP=TEMPS[I]; 464 if NEXT_TEMP then leave; 465 end; 466 if ^first.zipnum and ^last.zipnum then meantemp=mean(LAST_TEMP,NEXT_TEMP); 467 end; 468 run; ERROR: Array subscript out of range at line 462 column 10. LASTOBS=1 date=31DEC2010 zipnum=90004 meantemp=. FIRST.zipnum=0 LAST.zipnum=1 I=18628 LAST_TEMP=. NEXT_TEMP=. _ERROR_=1 _N_=8764 NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 380 at 466:53 NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 3615150 observations read from the data set WORK.TEST. NOTE: There were 8765 observations read from the data set WORK.TEST. WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 8763 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.19 seconds cpu time 0.15 seconds
you look 7 days before jan2005 so your array must start before too.
I gave you a solution that works.
What do you change everything if you don't know array syntax?
If you must, start from the working solution, and change one thing at a time ("oh i'll try to start exactly on 01jan2005", or "oh i'll try curly brackets").
You are giving yourself (and us) more work than needed.
Using a SET statement with dataset HAVE specified as the argument twice, followed with a BY statement, provides a compact way to avoid the double "do until (last.zip)" approach:
data want (drop=_d);
set have (in=firstpass)
have (in=secondpass);
by zip;
array thist{%sysevalf{'01jan2016'd-7}:%sysevalf{'31dec2016'd+7} _temporary_;
if first.zip then call missing(of thist{*});
if firstpass then thist{date}=temp;
if secondpass;
if temp=. and lag(zip)^=zip and last.zip=0 then do _d=1 to 7 while(temp=.);
temp=mean(thist{date - _d},thist{date + _d});
end;
run;
Notes:
thank you! tried the following code, and get the following errors. Ideas? I don't understand the array well enough to figure out how to debug this. looking into it now.
proc sort data=have; by zipnum; run;
data want (drop=_d);
set have (in=firstpass)
have (in=secondpass);
by zipnum;
array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_;
if first.zipnum then call missing(of thist{*});
if firstpass then thist{date}=meantemp;
if secondpass;
if meantemp=. and lag(zipnum)^=zip and last.zipum=0 then do _d=1 to 7 while(meantemp=.);
meantemp=mean(thist{date - _d},thist{date + _d});
end;
run;
181 data test2 (drop=_d); 182 set test (in=firstpass) 183 test (in=secondpass); 184 by zipnum; 185 186 array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_; - 22 76 ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses. ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses. ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *. ERROR 76-322: Syntax error, statement will be ignored. 187 if first.zipnum then call missing(of thist{*}); 188 if firstpass then thist{date}=meantemp; ERROR: Too many array subscripts specified for array thist. 189 190 if secondpass; 191 if meantemp=. and lag(zipnum)^=zip and last.zipum=0 then do _d=1 to 7 while(meantemp=.); 192 meantemp=mean(thist{date - _d},thist{date + _d}); ERROR: Too many array subscripts specified for array thist. ERROR: Too many array subscripts specified for array thist. 193 end; 194 run; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 192:19 192:36 NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.TEST2 may be incomplete. When this step was stopped there were 0 observations and 4 variables. WARNING: Data set WORK.TEST2 was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds
Your error message
186 array thist{%sysevalf{'01jan2005'd-7}:%sysevalf{'31dec2010'd+7} _temporary_; - 22 76 ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses. ERROR: %SYSEVALF must be followed by an expression enclosed in parentheses. ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *. ERROR 76-322: Syntax error, statement will be ignored.
You cannot use curly brackets in function calls, use normal parentheses.
%sysevalf('01jan2005'd-7)
NOTE: you don't NEED to use square [] or curly {} brackets anywhere in base SAS, you can just use normal () parentheses for everything.
array thist( %sysevalf('01jan2005'd-7):%sysevalf('31dec2010'd+7)) _temporary_;
I find it makes typing much easier. Plus if your fonts are too small it is really hard to tell the difference. Never got in the habit of using them anyway since they caused havoc in the old days when using ASCII terminals to connect to IBM mainframes.
second attempt. tried this. no errors, but new dataset looks same as original.
513 proc sort data=test; by zipnum; run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 514 data want (drop=_d); 515 set test (in=firstpass) 516 test (in=secondpass); 517 by zipnum; 518 519 array thist(%sysevalf('01jan2005'd-7):%sysevalf('31dec2010'd+7)) _temporary_; 520 if first.zipnum then call missing(of thist{*}); 521 if firstpass then thist{date}=meantemp; 522 523 if secondpass; 524 if meantemp=. and lag(zipnum)^=zipnum and last.zipnum=0 then do _d=1 to 7 while(meantemp=.); 525 meantemp=mean(thist{date - _d},thist{date + _d}); 526 end; 527 run; NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 4172 at 525:14 NOTE: There were 3615150 observations read from the data set WORK.TEST. NOTE: There were 3615150 observations read from the data set WORK.TEST. NOTE: The data set WORK.WANT has 3615150 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.74 seconds cpu time 0.71 seconds
How about this one .
data have;
input date $ zip temp;
datalines;
jan1 90001 50
jan2 90001 51
jan3 90001 53
jan4 90001 .
jan5 90001 49
jan6 90001 .
jan7 90001 .
jan8 90001 .
jan9 90001 50
jan10 90001 55
;
run;
data key;
set have;
by temp notsorted;
k+first.temp;
run;
data want;
if _n_=1 then do;
if 0 then set key;
declare hash h(dataset:'key',hashexp:20);
h.definekey('k');
h.definedata('temp');
h.definedone();
end;
set key;
if missing(temp) then do;
call missing(temp);
rc=h.find(key:k-1);
lag=temp;
call missing(temp);
rc=h.find(key:k+1);
next=temp;
temp=mean(lag,next);
end;
drop rc lag next k;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.