- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
I have the following dataset:
station | year | month | day | prep |
1 | 2000 | 12 | 17 | 1 |
1 | 2000 | 12 | 18 | 1 |
1 | 2000 | 12 | 19 | 1 |
1 | 2000 | 12 | 20 | 1 |
1 | 2000 | 12 | 21 | . |
1 | 2000 | 12 | 22 | . |
1 | 2000 | 12 | 23 | . |
1 | 2000 | 12 | 24 | 1 |
1 | 2000 | 12 | 25 | 1 |
1 | 2000 | 12 | 26 | 1 |
1 | 2000 | 12 | 27 | 1 |
1 | 2000 | 12 | 28 | . |
1 | 2000 | 12 | 29 | 1 |
1 | 2000 | 12 | 30 | . |
1 | 2000 | 12 | 31 | . |
1 | 2001 | 1 | 1 | . |
1 | 2001 | 1 | 2 | . |
1 | 2001 | 1 | 3 | . |
1 | 2001 | 1 | 4 | 1 |
1 | 2001 | 1 | 5 | 1 |
1 | 2001 | 1 | 6 | 1 |
1 | 2001 | 1 | 7 | . |
1 | 2001 | 1 | 8 | 1 |
1 | 2001 | 1 | 9 | . |
1 | 2001 | 1 | 10 | . |
1 | 2001 | 1 | 11 | 1 |
1 | 2001 | 1 | 12 | 1 |
2 | 2000 | 12 | 17 | 1 |
2 | 2000 | 12 | 18 | 1 |
2 | 2000 | 12 | 19 | 1 |
2 | 2000 | 12 | 20 | 1 |
2 | 2000 | 12 | 21 | . |
2 | 2000 | 12 | 22 | . |
2 | 2000 | 12 | 23 | . |
2 | 2000 | 12 | 24 | 1 |
2 | 2000 | 12 | 25 | 1 |
2 | 2000 | 12 | 26 | 1 |
2 | 2000 | 12 | 27 | 1 |
2 | 2000 | 12 | 28 | . |
2 | 2000 | 12 | 29 | 1 |
2 | 2000 | 12 | 30 | . |
2 | 2000 | 12 | 31 | . |
2 | 2003 | 1 | 1 | . |
2 | 2003 | 1 | 2 | . |
2 | 2003 | 1 | 3 | . |
2 | 2003 | 1 | 4 | 1 |
2 | 2003 | 1 | 5 | 1 |
2 | 2003 | 1 | 6 | 1 |
2 | 2003 | 1 | 7 | . |
2 | 2003 | 1 | 8 | 1 |
2 | 2003 | 1 | 9 | . |
2 | 2003 | 1 | 10 | . |
2 | 2003 | 1 | 11 | 1 |
2 | 2003 | 1 | 12 | 1 |
I use this code to calculate the continuous missing value by station,
data myfile_1;
count=0;
do until(last.prep);
set myfile;
by station prep notsorted;
if missing(prep) then count+1;
end;
do until(last.prep);
set myfile;
by station prep notsorted;
output;
end;
run;
However, here is my question. What should I do if I want to calculate the missing value only within consecutive years. For station 1, the time is continuous, so I could use this code to calculate. However, for station 2, year 2003 is right after 2001, if I use this code, the results will be calculated as 5 continuous missing value together among 2001 and 2003, not 2 and 3 separately.
I will really appreciate if someone has any idea about this.
Thank you!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:
data halfway_there;
set have;
by station year month day;
gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));
if first.station or (gap_size ne 1) then group + 1;
run;
Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:
data want;
count=0;
do until (last.prep);
set halfway_there;
by group prep notsorted;
if missing(prep) then count + 1;
end;
do until (last.prep);
set halfway_there;
by group prep notsorted;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You'd need to modify the data set so that each station has at least one record for each year in the study. Since you are coutning missing values of prep, this added record would have to be non-missing for prep.
Then the next step is simple. Use PROC SUMMARY to count the number of missings by station and year, then in a data step, you could easily add the number of missings for the current year to the number of missings from the previous year (using the LAG function).
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you have a large dataset, these steps really won't add much time to the running of the program.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:
data halfway_there;
set have;
by station year month day;
gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));
if first.station or (gap_size ne 1) then group + 1;
run;
Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:
data want;
count=0;
do until (last.prep);
set halfway_there;
by group prep notsorted;
if missing(prep) then count + 1;
end;
do until (last.prep);
set halfway_there;
by group prep notsorted;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't see a need to process the data twice (especially since it was stated this is a large data set).
However, we may be talking about the same thing, detecting gaps or adding records, eventually gets us to the same point.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The code works exactly what I want, but I think there has an error that mdy() function should be mdy(month, day, year) not mdy(year, month, day).
Thank you !