Solved: Average sample data if it lays between 2 dates

ZacLopresti · Posted 02-24-2021 09:01 PM

Hi all,

First question here.

I have two datasets, the first containing a sample result column and a datetime column (when the sample was taken). The other, Containing a bed number, start date and end date.

What I want to achieve is an averaged sample value for each bed.

i.e. average the sample results if they lay between the start and end dates of each bed.

Thanks

PGStats · Posted 03-01-2021 10:50 PM

For equal weighting, it is quite simple, as long as bed periods don't overlap:

proc sql;
create table want as
select 
    bed,
    start, end,
    mean(PercentGrind) as averageGrind
from 
    BED_DATETIME as a left join
    BATTERYFEEDSAMPLE as b on b.datetime between start and end
group by bed, start, end;
quit;

PG

View solution in original post

ballardw · Posted 02-24-2021 09:52 PM

Average sample of what?

What type of variables are your start, end and datetime? Numeric with datetime format applied or character? One cannot tell from a picture.

Please do not call values with a time component a "date" when discussing SAS variables. Dates and times use different units of measurement and when you say "date" that means days. And they don't compare well with time plus raising a question of are the "start" and "end" supposed to be considering the "date" portion or the actual datetime value the picture implies.

And since "bed" is involved per "What I want to achieve is an averaged sample value for each bed." then which "bed" is the "sample" from? Or are we supposed to guess which bed was measured?

Maybe something along these lines but completely untested as I can't write code to read pictures and you didn't provide names of data sets.

ZacLopresti · Posted 03-01-2021 09:27 PM

Thanks for your reply.

Maybe a better way of explaining is as follows.

I have a bed (which is a mix of minerals), during the time that a specific bed is being used, multiple samples of the size fraction (PercentGrind) of the bed are taken. I want to average all the sample results taken within a specific bed.

I want the average of all the PercentGrind values if the corresponding DateTimes fall between the Startand End DateTimes of a specific Bed.

To answer your questions:

- All datetime variables are numeric.

-Each sample has the DateTime at which it was taken recorded with it. If the DateTime is between the Start and End Datetimes it has been taken from that bed.

- I will try and upload the data not as oictures but had troubles in the past.

Thanks,

Zac

ZacLopresti · Posted 03-01-2021 09:43 PM

data WORK.BATTERYFEEDSAMPLE;

infile datalines dsd truncover;

input DateTime:DATETIME20. PercentGrind:6.1;

format DateTime DATETIME20. PercentGrind 6.1;

label DateTime="SamplingStartTime" PercentGrind="Grind (%)";

datalines;

25JUN2016:13:00:00 89.9

26JUN2016:01:00:00 89.8

26JUN2016:13:00:00 87.0

27JUN2016:13:00:00 90.3

28JUN2016:13:00:00 90.6

29JUN2016:01:00:00 90.9

29JUN2016:01:01:00 .

29JUN2016:13:00:00 89.8

30JUN2016:01:00:00 90.2

30JUN2016:13:00:00 89.6

01JUL2016:01:00:00 90.6

01JUL2016:13:00:00 86.1

02JUL2016:01:00:00 86.9

02JUL2016:13:00:00 86.1

03JUL2016:01:00:00 86.1

03JUL2016:13:00:00 84.7

04JUL2016:01:00:00 88.0

04JUL2016:13:00:00 89.7

05JUL2016:01:00:00 89.4

05JUL2016:13:00:00 89.6

06JUL2016:01:00:00 86.8

06JUL2016:01:01:00 .

06JUL2016:13:00:00 87.8

07JUL2016:01:00:00 89.4

07JUL2016:13:00:00 89.8

08JUL2016:01:00:00 89.9

08JUL2016:13:00:00 89.8

09JUL2016:01:00:00 90.0

;;;;

data WORK.BED_DATETIME;

infile datalines dsd truncover;

input Bed:4. Start:DATETIME20. End:DATETIME20.;

format Bed 4. Start DATETIME20. End DATETIME20.;

label Bed="Bed";

datalines;

1668 26JUN2016:02:22:00 02JUL2016:03:29:59

1669 02JUL2016:03:30:00 08JUL2016:19:55:59

1670 08JUL2016:19:56:00 15JUL2016:20:18:59

1671 15JUL2016:20:19:00 22JUL2016:17:14:59

1672 22JUL2016:17:15:00 29JUL2016:14:26:59

1673 29JUL2016:14:27:00 05AUG2016:11:11:59

1674 05AUG2016:11:12:00 12AUG2016:13:41:59

;;;;

mkeintz · Posted 02-26-2021 01:17 AM

I presume you want the weighted average PERCENTGRIND for each bed (from dataset BEDS) where weights are determined by duration of the bed used within each period identified in your PCT_GRIND dataset (that's the name I assign it). I also presume that the PCT_GRIND date/time stamps are the start of a given level of PERCENTGRIND.

Luckily (1) your bed usages do not overlap (each END precedes the next START in the dataset (which I call BEDS), and (2) both datasets are sorted chronologically (BEDS is sorted by START, and PCT_GRIND is sorted by DATETIME). These features are assumed to this program.

Getting a weighted average is straightforward using PROC SUMMARY (also know as PROC MEANS). The problem is generating a dataset with the weights (variable DURATION in data set NEED):

data need;
  set beds ;

  /* Exhaust PCT_GRIND periods that cover this START-END period */
  do until (datetime>end);
    if (datetime<=start<=nxt_dt)  or (datetime<=end<=nxt_dt) then do;
      duration =  1 + min(end,nxt_dt) - max(start,datetime);
      output;
    end;
    if nxt_dt>end then leave;  ** Read no more PCT_GRIND for this bed**;
    merge pct_grind
          pct_grind (firstobs=2 keep=datetime rename=(datetime=nxt_dt));
    format duration time10.0;
  end;
run;

proc summary data=need noprint;
  class bed;
  var percentgrind;  
  weight duration;   
  ways 1;
  output out=want mean=pctgrind_mean sumwgt=total_duration / autoname;
  format total_duration time10.0;
run;

For each bed, dataset NEED gets as many records as there are overlaps between the bed usage (START to END) with the time spans in PCT_GRIND (DATETIME to NXT_DT). Each such record gets a variable called DURATION. That's where the

duration =  1 + min(end,nxt_dt) - max(start,datetime);

comes in. Note the NXT_DT is the datetime starting the next PCT_GRIND span. It is obtained by use of the "FIRSTOBS=2" parameter in the self-merge activated by the MERGE statement.

Since PERCENTGRIND is missing for some time spans, any bed use that includes that time span will include only the neighboring non-missing spans that overlaps with bed usage.

Also any bed usage that is outside of the range of times in PCT_GRIND will not be processed at all.

I recommend you examine dataset NEED to see the intent of the data step programming.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ZacLopresti · Posted 03-01-2021 09:46 PM

Thanks for your reply.

I want all samples to be equally weight though (regardless of time).

I'll have a play with what you've sent and see if I can modify it.

mkeintz · Posted 03-01-2021 10:41 PM

@ZacLopresti wrote:

Thanks for your reply.

I want all samples to be equally weight though (regardless of time).

Then drop the WEIGHT DURATION statement in the proc summary. And if you are never going to examine the exposure time to each PERCENTGRIND, you can also drop the calculation of DURATION in the DATA NEED step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ZacLopresti · Posted 03-02-2021 07:27 PM

Worked well.
Thankyou.

PGStats · Posted 03-01-2021 10:50 PM

For equal weighting, it is quite simple, as long as bed periods don't overlap:

proc sql;
create table want as
select 
    bed,
    start, end,
    mean(PercentGrind) as averageGrind
from 
    BED_DATETIME as a left join
    BATTERYFEEDSAMPLE as b on b.datetime between start and end
group by bed, start, end;
quit;

PG

ZacLopresti · Posted 03-02-2021 07:28 PM

Fantastic. Thanks

Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Re: Average sample data if it lays between 2 dates

Registration is open

SAS Training: Just a Click Away