Hello all,
I have the following dataset:
station | year | month | day | prep |
1 | 2000 | 12 | 17 | 1 |
1 | 2000 | 12 | 18 | 1 |
1 | 2000 | 12 | 19 | 1 |
1 | 2000 | 12 | 20 | 1 |
1 | 2000 | 12 | 21 | . |
1 | 2000 | 12 | 22 | . |
1 | 2000 | 12 | 23 | . |
1 | 2000 | 12 | 24 | 1 |
1 | 2000 | 12 | 25 | 1 |
1 | 2000 | 12 | 26 | 1 |
1 | 2000 | 12 | 27 | 1 |
1 | 2000 | 12 | 28 | . |
1 | 2000 | 12 | 29 | 1 |
1 | 2000 | 12 | 30 | . |
1 | 2000 | 12 | 31 | . |
1 | 2001 | 1 | 1 | . |
1 | 2001 | 1 | 2 | . |
1 | 2001 | 1 | 3 | . |
1 | 2001 | 1 | 4 | 1 |
1 | 2001 | 1 | 5 | 1 |
1 | 2001 | 1 | 6 | 1 |
1 | 2001 | 1 | 7 | . |
1 | 2001 | 1 | 8 | 1 |
1 | 2001 | 1 | 9 | . |
1 | 2001 | 1 | 10 | . |
1 | 2001 | 1 | 11 | 1 |
1 | 2001 | 1 | 12 | 1 |
2 | 2000 | 12 | 17 | 1 |
2 | 2000 | 12 | 18 | 1 |
2 | 2000 | 12 | 19 | 1 |
2 | 2000 | 12 | 20 | 1 |
2 | 2000 | 12 | 21 | . |
2 | 2000 | 12 | 22 | . |
2 | 2000 | 12 | 23 | . |
2 | 2000 | 12 | 24 | 1 |
2 | 2000 | 12 | 25 | 1 |
2 | 2000 | 12 | 26 | 1 |
2 | 2000 | 12 | 27 | 1 |
2 | 2000 | 12 | 28 | . |
2 | 2000 | 12 | 29 | 1 |
2 | 2000 | 12 | 30 | . |
2 | 2000 | 12 | 31 | . |
2 | 2003 | 1 | 1 | . |
2 | 2003 | 1 | 2 | . |
2 | 2003 | 1 | 3 | . |
2 | 2003 | 1 | 4 | 1 |
2 | 2003 | 1 | 5 | 1 |
2 | 2003 | 1 | 6 | 1 |
2 | 2003 | 1 | 7 | . |
2 | 2003 | 1 | 8 | 1 |
2 | 2003 | 1 | 9 | . |
2 | 2003 | 1 | 10 | . |
2 | 2003 | 1 | 11 | 1 |
2 | 2003 | 1 | 12 | 1 |
I use this code to calculate the continuous missing value by station,
data myfile_1;
count=0;
do until(last.prep);
set myfile;
by station prep notsorted;
if missing(prep) then count+1;
end;
do until(last.prep);
set myfile;
by station prep notsorted;
output;
end;
run;
However, here is my question. What should I do if I want to calculate the missing value only within consecutive years. For station 1, the time is continuous, so I could use this code to calculate. However, for station 2, year 2003 is right after 2001, if I use this code, the results will be calculated as 5 continuous missing value together among 2001 and 2003, not 2 and 3 separately.
I will really appreciate if someone has any idea about this.
Thank you!
I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:
data halfway_there;
set have;
by station year month day;
gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));
if first.station or (gap_size ne 1) then group + 1;
run;
Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:
data want;
count=0;
do until (last.prep);
set halfway_there;
by group prep notsorted;
if missing(prep) then count + 1;
end;
do until (last.prep);
set halfway_there;
by group prep notsorted;
output;
end;
run;
You'd need to modify the data set so that each station has at least one record for each year in the study. Since you are coutning missing values of prep, this added record would have to be non-missing for prep.
Then the next step is simple. Use PROC SUMMARY to count the number of missings by station and year, then in a data step, you could easily add the number of missings for the current year to the number of missings from the previous year (using the LAG function).
If you have a large dataset, these steps really won't add much time to the running of the program.
I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:
data halfway_there;
set have;
by station year month day;
gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));
if first.station or (gap_size ne 1) then group + 1;
run;
Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:
data want;
count=0;
do until (last.prep);
set halfway_there;
by group prep notsorted;
if missing(prep) then count + 1;
end;
do until (last.prep);
set halfway_there;
by group prep notsorted;
output;
end;
run;
I don't see a need to process the data twice (especially since it was stated this is a large data set).
However, we may be talking about the same thing, detecting gaps or adding records, eventually gets us to the same point.
The code works exactly what I want, but I think there has an error that mdy() function should be mdy(month, day, year) not mdy(year, month, day).
Thank you !
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.