BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hua
Obsidian | Level 7 hua
Obsidian | Level 7

Hello all,

I have the following dataset:

station year month day prep
1 2000 12 17 1
1 2000 12 18 1
1 2000 12 19 1
1 2000 12 20 1
1 2000 12 21 .
1 2000 12 22 .
1 2000 12 23 .
1 2000 12 24 1
1 2000 12 25 1
1 2000 12 26 1
1 2000 12 27 1
1 2000 12 28 .
1 2000 12 29 1
1 2000 12 30 .
1 2000 12 31 .
1 2001 1 1 .
1 2001 1 2 .
1 2001 1 3 .
1 2001 1 4 1
1 2001 1 5 1
1 2001 1 6 1
1 2001 1 7 .
1 2001 1 8 1
1 2001 1 9 .
1 2001 1 10 .
1 2001 1 11 1
1 2001 1 12 1
2 2000 12 17 1
2 2000 12 18 1
2 2000 12 19 1
2 2000 12 20 1
2 2000 12 21 .
2 2000 12 22 .
2 2000 12 23 .
2 2000 12 24 1
2 2000 12 25 1
2 2000 12 26 1
2 2000 12 27 1
2 2000 12 28 .
2 2000 12 29 1
2 2000 12 30 .
2 2000 12 31 .
2 2003 1 1 .
2 2003 1 2 .
2 2003 1 3 .
2 2003 1 4 1
2 2003 1 5 1
2 2003 1 6 1
2 2003 1 7 .
2 2003 1 8 1
2 2003 1 9 .
2 2003 1 10 .
2 2003 1 11 1
2 2003 1 12 1

 

I use this code to calculate the continuous missing value by station,

data myfile_1;
count=0;
do until(last.prep);
set myfile;
by station prep notsorted;
if missing(prep) then count+1;
end;
do until(last.prep);
set myfile;
by station prep notsorted;
output;
end;
run;

 

However, here is my question. What should I do if I want to calculate the missing value only within consecutive years. For station 1, the time is continuous, so I could use this code to calculate. However, for station 2, year 2003 is right after 2001, if I use this code, the results will be calculated as 5 continuous missing value together among 2001 and 2003, not 2 and 3 separately.

 

I will really appreciate if someone has any idea about this.

Thank you! 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

I don't see a need to add observations, but I do see a need to process the data twice.  The first time, detect the gaps:

 

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

 

Now the variable GROUP is unique for each group of observations that should be processed together.  You could proceed in very similar fashion to what you have already done:

 

data want;

count=0;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   if missing(prep) then count + 1;

end;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   output;

end;

run;

 

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

You'd need to modify the data set so that each station has at least one record for each year in the study. Since you are coutning missing values of prep, this added record would have to be non-missing for prep.

 

Then the next step is simple. Use PROC SUMMARY to count the number of missings by station and year, then in a data step, you could easily add the number of missings for the current year to the number of missings from the previous year (using the LAG function).

--
Paige Miller
hua
Obsidian | Level 7 hua
Obsidian | Level 7
Thank you! Yes, I have been thinking about this way, but it will add a lot of computation since I have a huge dataset. But if there has no other method, this is a way.
PaigeMiller
Diamond | Level 26

If you have a large dataset, these steps really won't add much time to the running of the program.

--
Paige Miller
Astounding
PROC Star

I don't see a need to add observations, but I do see a need to process the data twice.  The first time, detect the gaps:

 

data halfway_there;

set have;

by station year month day;

gap_size = mdy(year, month, day) - mdy(lag(year), lag(month), lag(day));

if first.station or (gap_size ne 1) then group + 1;

run;

 

Now the variable GROUP is unique for each group of observations that should be processed together.  You could proceed in very similar fashion to what you have already done:

 

data want;

count=0;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   if missing(prep) then count + 1;

end;

do until (last.prep);

   set halfway_there;

   by group prep notsorted;

   output;

end;

run;

 
PaigeMiller
Diamond | Level 26

I don't see a need to process the data twice (especially since it was stated this is a large data set).

 

However, we may be talking about the same thing, detecting gaps or adding records, eventually gets us to the same point.

--
Paige Miller
hua
Obsidian | Level 7 hua
Obsidian | Level 7
Thank you for help! You're right, detecting gaps is the first thing to do when facing this problem.
hua
Obsidian | Level 7 hua
Obsidian | Level 7

The code works exactly what I want, but I think there has an error that mdy() function should be mdy(month, day, year) not mdy(year, month, day).

Thank you !

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3155 views
  • 1 like
  • 3 in conversation