Solved: Re: count continuous missing value within consecutive years.

hua · Posted 10-05-2017 04:15 PM

Hello all,

I have the following dataset:

station	year	month	day	prep
1	2000	12	17	1
1	2000	12	18	1
1	2000	12	19	1
1	2000	12	20	1
1	2000	12	21	.
1	2000	12	22	.
1	2000	12	23	.
1	2000	12	24	1
1	2000	12	25	1
1	2000	12	26	1
1	2000	12	27	1
1	2000	12	28	.
1	2000	12	29	1
1	2000	12	30	.
1	2000	12	31	.
1	2001	1	1	.
1	2001	1	2	.
1	2001	1	3	.
1	2001	1	4	1
1	2001	1	5	1
1	2001	1	6	1
1	2001	1	7	.
1	2001	1	8	1
1	2001	1	9	.
1	2001	1	10	.
1	2001	1	11	1
1	2001	1	12	1
2	2000	12	17	1
2	2000	12	18	1
2	2000	12	19	1
2	2000	12	20	1
2	2000	12	21	.
2	2000	12	22	.
2	2000	12	23	.
2	2000	12	24	1
2	2000	12	25	1
2	2000	12	26	1
2	2000	12	27	1
2	2000	12	28	.
2	2000	12	29	1
2	2000	12	30	.
2	2000	12	31	.
2	2003	1	1	.
2	2003	1	2	.
2	2003	1	3	.
2	2003	1	4	1
2	2003	1	5	1
2	2003	1	6	1
2	2003	1	7	.
2	2003	1	8	1
2	2003	1	9	.
2	2003	1	10	.
2	2003	1	11	1
2	2003	1	12	1

I use this code to calculate the continuous missing value by station,

data myfile_1;
count=0;
do until(last.prep);
set myfile;
by station prep notsorted;
if missing(prep) then count+1;
end;
do until(last.prep);
set myfile;
by station prep notsorted;
output;
end;
run;

However, here is my question. What should I do if I want to calculate the missing value only within consecutive years. For station 1, the time is continuous, so I could use this code to calculate. However, for station 2, year 2003 is right after 2001, if I use this code, the results will be calculated as 5 continuous missing value together among 2001 and 2003, not 2 and 3 separately.

I will really appreciate if someone has any idea about this.

Thank you!

Astounding · Posted 10-06-2017 08:27 AM

I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:

data want;

count=0;

do until (last.prep);

set halfway_there;

by group prep notsorted;

if missing(prep) then count + 1;

end;

do until (last.prep);

set halfway_there;

by group prep notsorted;

output;

end;

run;

View solution in original post

PaigeMiller · Posted 10-05-2017 04:23 PM

You'd need to modify the data set so that each station has at least one record for each year in the study. Since you are coutning missing values of prep, this added record would have to be non-missing for prep.

Then the next step is simple. Use PROC SUMMARY to count the number of missings by station and year, then in a data step, you could easily add the number of missings for the current year to the number of missings from the previous year (using the LAG function).

--
Paige Miller

hua · Posted 10-05-2017 04:28 PM

Thank you! Yes, I have been thinking about this way, but it will add a lot of computation since I have a huge dataset. But if there has no other method, this is a way.

PaigeMiller · Posted 10-05-2017 05:55 PM

If you have a large dataset, these steps really won't add much time to the running of the program.

--
Paige Miller

Astounding · Posted 10-06-2017 08:27 AM

I don't see a need to add observations, but I do see a need to process the data twice. The first time, detect the gaps:

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

Now the variable GROUP is unique for each group of observations that should be processed together. You could proceed in very similar fashion to what you have already done:

data want;

count=0;

do until (last.prep);

set halfway_there;

by group prep notsorted;

if missing(prep) then count + 1;

end;

do until (last.prep);

set halfway_there;

by group prep notsorted;

output;

end;

run;

PaigeMiller · Posted 10-06-2017 08:39 AM

I don't see a need to process the data twice (especially since it was stated this is a large data set).

However, we may be talking about the same thing, detecting gaps or adding records, eventually gets us to the same point.

--
Paige Miller

hua · Posted 10-06-2017 04:35 PM

Thank you for help! You're right, detecting gaps is the first thing to do when facing this problem.

hua · Posted 10-06-2017 04:33 PM

The code works exactly what I want, but I think there has an error that mdy() function should be mdy(month, day, year) not mdy(year, month, day).

Thank you !

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away