Help using Base SAS procedures

count continuous missing value within consecutive years.

Accepted Solution Solved
Reply
Highlighted
Contributor hua
Contributor
Posts: 43
Accepted Solution

count continuous missing value within consecutive years.

Hello all,

I have the following dataset:

station year month day prep
1 2000 12 17 1
1 2000 12 18 1
1 2000 12 19 1
1 2000 12 20 1
1 2000 12 21 .
1 2000 12 22 .
1 2000 12 23 .
1 2000 12 24 1
1 2000 12 25 1
1 2000 12 26 1
1 2000 12 27 1
1 2000 12 28 .
1 2000 12 29 1
1 2000 12 30 .
1 2000 12 31 .
1 2001 1 1 .
1 2001 1 2 .
1 2001 1 3 .
1 2001 1 4 1
1 2001 1 5 1
1 2001 1 6 1
1 2001 1 7 .
1 2001 1 8 1
1 2001 1 9 .
1 2001 1 10 .
1 2001 1 11 1
1 2001 1 12 1
2 2000 12 17 1
2 2000 12 18 1
2 2000 12 19 1
2 2000 12 20 1
2 2000 12 21 .
2 2000 12 22 .
2 2000 12 23 .
2 2000 12 24 1
2 2000 12 25 1
2 2000 12 26 1
2 2000 12 27 1
2 2000 12 28 .
2 2000 12 29 1
2 2000 12 30 .
2 2000 12 31 .
2 2003 1 1 .
2 2003 1 2 .
2 2003 1 3 .
2 2003 1 4 1
2 2003 1 5 1
2 2003 1 6 1
2 2003 1 7 .
2 2003 1 8 1
2 2003 1 9 .
2 2003 1 10 .
2 2003 1 11 1
2 2003 1 12 1

 

I use this code to calculate the continuous missing value by station,

data myfile_1;
count=0;
do until(last.prep);
set myfile;
by station prep notsorted;
if missing(prep) then count+1;
end;
do until(last.prep);
set myfile;
by station prep notsorted;
output;
end;
run;

 

However, here is my question. What should I do if I want to calculate the missing value only within consecutive years. For station 1, the time is continuous, so I could use this code to calculate. However, for station 2, year 2003 is right after 2001, if I use this code, the results will be calculated as 5 continuous missing value together among 2001 and 2003, not 2 and 3 separately.

 

I will really appreciate if someone has any idea about this.

Thank you! 


Accepted Solutions
Solution
2 weeks ago
Super User
Posts: 5,369

Re: count continuous missing value within consecutive years.

I don't see a need to add observations, but I do see a need to process the data twice.  The first time, detect the gaps:

 

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

 

Now the variable GROUP is unique for each group of observations that should be processed together.  You could proceed in very similar fashion to what you have already done:

 

data want;

count=0;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   if missing(prep) then count + 1;

end;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   output;

end;

run;

 

View solution in original post


All Replies
Trusted Advisor
Posts: 1,795

Re: count continuous missing value within consecutive years.

You'd need to modify the data set so that each station has at least one record for each year in the study. Since you are coutning missing values of prep, this added record would have to be non-missing for prep.

 

Then the next step is simple. Use PROC SUMMARY to count the number of missings by station and year, then in a data step, you could easily add the number of missings for the current year to the number of missings from the previous year (using the LAG function).

Contributor hua
Contributor
Posts: 43

Re: count continuous missing value within consecutive years.

Thank you! Yes, I have been thinking about this way, but it will add a lot of computation since I have a huge dataset. But if there has no other method, this is a way.
Trusted Advisor
Posts: 1,795

Re: count continuous missing value within consecutive years.

If you have a large dataset, these steps really won't add much time to the running of the program.

Solution
2 weeks ago
Super User
Posts: 5,369

Re: count continuous missing value within consecutive years.

I don't see a need to add observations, but I do see a need to process the data twice.  The first time, detect the gaps:

 

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

 

Now the variable GROUP is unique for each group of observations that should be processed together.  You could proceed in very similar fashion to what you have already done:

 

data want;

count=0;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   if missing(prep) then count + 1;

end;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   output;

end;

run;

 
Trusted Advisor
Posts: 1,795

Re: count continuous missing value within consecutive years.

I don't see a need to process the data twice (especially since it was stated this is a large data set).

 

However, we may be talking about the same thing, detecting gaps or adding records, eventually gets us to the same point.

Contributor hua
Contributor
Posts: 43

Re: count continuous missing value within consecutive years.

Thank you for help! You're right, detecting gaps is the first thing to do when facing this problem.
Contributor hua
Contributor
Posts: 43

Re: count continuous missing value within consecutive years.

The code works exactly what I want, but I think there has an error that mdy() function should be mdy(month, day, year) not mdy(year, month, day).

Thank you !

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 177 views
  • 1 like
  • 3 in conversation