Dataset "A" is composed of bird nesting information with a single nesting attempt coded per line with multiple variables. The important variables in this situation:
ID (unique sequential observation identifier);
AREA (study area - 1, 2, or 3);
YEAR (study year - 2005, 2006, or 2007);
START (start day of the nesting attempt - all dates are coded as 1 for January 1st continuing sequentially until the final day of that year);
FINISH (last day of nesting attempt).
Dataset "B" is composed of weather information with a single day per line of code with multiple measurement variables on each day. The important variables in this situation:
AREA (similar to above);
YEAR (study year - 2005, 2006, or 2007);
DAY (similar format to above);
H_TEMP (high temp for that day);
L_TEMP (low temp for that day);
M_TEMP (mean temp for that day);
PRECIP (precipitation in inches).
What I need to do is calculate a min, mean, and max H_TEMP, L_TEMP, and M_TEMP and a sum of all PRECIP over the nesting period for each nesting attempt in dataset "A" by AREA. Since the date codes used are not unique per year I'm assuming the code with have to be written on a by year basis, and ideally the output would be new variables added to Dataset "A" for each nesting attempt (but the created dataset "C" could always be merged with dataset "A" by ID if necessary, I'm assuming).
I've searched online for an example or code to help with this but have been unsuccessful in finding anything. Any code, help or keywords would be greatly appreciated.
As far as output goes I'm envisioning a dataset "C"with the following variables:
ID - individual nesting attempt identifier from dataset "A"
YEAR - coincides with both dataset "A" and "B"
AREA - coincides with both dataset "A" and "B"
min_H_TEMP - calculated (see below for how)
mean_H_TEMP - calculated (see below for how)
max_H_TEMP - calculated (see below for how)
min_L_TEMP - calculated (see below for how)
mean_L_TEMP - calculated (see below for how)
max_L_TEMP - calculated (see below for how)
min_M_TEMP - calculated (see below for how)
mean_M_TEMP - calculated (see below for how)
max_M_TEMP - calculated (see below for how)
sum_PRECIP - calculated (see below for how)
Each sample in dataset "A" has a unique identifier (ID) and a start (START) and an end (FINISH) date associated with a specific year of the study (YEAR) and a study area (AREA).
Dataset "B" has corresponding YEAR, AREA, and DAY (related to START and FINISH) information for weather variables (H_TEMP, L_TEMP, M_TEMP, and PRECIP).
I need to figure out how to write SAS code to use dataset "A" to find START and FINISH dates for an individual nesting attempt identified by ID, YEAR, and AREA and then take those values and calculate based on dataset "B" the min, max and mean H_TEMP, L_TEMP, and M_TEMP as well as the sum of all precipitation (PRECIP) for the specified interval (START->FINISH) at the same AREA in the same YEAR as the nesting attempt (ID) from dataset "A".
Here's an example from above starting in dataset "A" with this information:
For this sample I would need the mean, min, and max H_TEMP, L_TEMP, and M_TEMP and sum of PRECIP calculated for the interval starting on 171 and ending on 204 from dataset "B" for 2006 at area 3.
Just curious...what's supposed to happen with the missing values from A, as, for example on ID 542 or 548 or 553?? Should those observations be ignored????
Also, you say you want the min, max AND mean for h_temp, l_temp and m_temp...so you'd get:
min_h_temp, max_h_temp, and mean_h_temp???? (High)
min_l_temp, max_l_temp and min_l_temp??? (Low)
min_m_temp, max_m_temp and mean_m_temp??? (Mean)
I don't understand why you'd calculate the mean of the mean...do you really want to add up all the mean temps and divide by the count of days? I thought taking the mean of the mean was not a great idea? (I'm not a heavy stat person -- I just know that statisticians that I've worked for (in the past) would never have let me calculate the mean of the mean.)
As I said, just curious -- I think what you want to do is do-able -- probably with some Proc TRANSPOSE and Data step programming, but some of what you need/want seems confusing to me.
As far as missing values from "A" are concerned for now yes they can be ignored. If both values are missing a nest was not initiated for that observation, but if only FINISH is missing then it was an unsucessful nest and I need to go back to the log books and enter the date it failed into the code.
In general I would definitely agree with you that calculating the mean of the mean is a bad practice and if I had hourly weather data I could calculate the mean temp over the course of the interval but all I have is the mean temp for each day, which leaves me calculating the mean of a M_TEMP over the course of the nesting interval (granted not ideal but will hopefully provide an idea of weather conditions during that time when combined with the other measures).
I'll look into the TRANSPOSE function but I think I'm going to need help getting through the DATA STEP portion.
Your sample data don't match too well (i.e. different years and area codes in the two datasets). I changed them a bit hoping that my assumptions were right.
Have a look at the following code. It might give you some hints even if it might not be exactly what you're looking for.
The code prepares mainly one big denormalised dataset (kind of a "datamart"). It shouldn't be too difficult to get what you want by using proc means on this dataset - and there you can also decide how to handle missings.
I converted the 'year' and 'day' variables to a variable 'date' containing a SAS date (count of days since 1 January 1960). This way your dates are "unique" for the same day in different years.
Since the year and day together make up a Julian date, I'd be tempted to do something like this instead of using arithmetic with a starting value of 0101:
schar = catt(year,put(start,z3.));
You still have to check for missing, etc, etc. I show it as 2 separate statements just for ease of understanding how the date is being turned first into a character value with year and the day value concatenated together and then the INPUT function uses the JULIAN informat to translate the julian form of the date to the SAS internal number.
A big thank you goes out to both of you for your help. I'm working on the code to compute means using PROC MEANS (output by ID) which is working I just need to update my files and code it the rest of the way through.
If I hit another wall I'll be sure to post up again. Thanks for the help,