DATA Step, Macro, Functions and more

start time and duration of a sequence

Accepted Solution Solved
Reply
Contributor
Posts: 37
Accepted Solution

start time and duration of a sequence

hello,

I have data that includes a time variable and weather descriptions from rater1 and rater2, e.g.,

data have;
input time : time5. rater1 : $ rater2 : $ ;
format  time time5.  ;
cards; 
1:01       RA DZ
1:02       RA DZ
1:03       RA DZ
2:06       DZ PL
2:07       DZ PL
2:15 PL ; run;

these sequences can go quite long (not just three minutes, but maybe three hours), and there are many of them. What I'd like is a summary of each, something like:

data want;
input startTime : time5. duration rater1 : $ rater2 : $ ;
format  time time5.  ;
cards; 
1:01   3    RA DZ
2:06   2    DZ PL
2:15 1 PL ; run;

rater1 and rater2 values include RA, DZ, PL, ' ', SN, RAPL, SNPL, and RASN. Any suggestions?

 

Thanks, Bruce


Accepted Solutions
Solution
‎03-01-2016 08:59 AM
Super User
Posts: 5,086

Re: start time and duration of a sequence

[ Edited ]

Aha!  Thanks for the answers.

 

I suggest adding a variable to number the sequences.

 

data with_sequence;

set have;

by rater1 rater2 notsorted;

time_dif = dif(time);

if first.rater2 then sequence + 1;

else if time_dif > 60 then sequence + 1;

run;

 

You can decide if the cutoff point of > 60 needs to be adjusted or not.

 

Then summarize:

 

proc summary data=with _sequence;

by sequence rater1 rater2 notsorted;

var time;

output out=summarized (drop=_type_ _freq_) min=time_begins max=time_ends;

run;

 

You would still need to read the summary back in, to compute duration.  Something along these lines:

 

data want;

set with_sequence;

duration = time_ends - time_begins + 60;

run;

 

I expect that the TIME-based statistics are measured in seconds thus you need to add 60.  But if you test this and find that's not the case, you can always adjust the formula.

 

Good luck.

 

Oops!  Added NOTSORTED a second time.

 

View solution in original post


All Replies
Super User
Posts: 5,086

Re: start time and duration of a sequence

Bruce,

 

Your question leaves a number of questions open to interpretation.  Perhaps you could narrow down the problem by addressing a few of these.

 

What is the definition of a sequence? 

 

If enough time passes, but the raters stay the same, would that begin a new sequence?

 

Within the same sequence could the raters switch positions (so that rater #1 becomes rater #2 and vice versa)?

 

Is duration a count of records, or does it represent a calculation based on the first and last TIME value?

 

Can two sequences overlap?

 

What is the order to the incoming data records? 

 

Three of these questions are really intertwined:  the definition of a sequence, overlapping sequences, the order to the incoming records.  They are all ways of looking at how the data identifies a sequence.

 

The program might be as simple as:

 

proc summary data=have nway;

class rater1 rater2;

var time;

output out=want (drop-_type_ rename=(_freq_=duration)) min=start_time;

run;

 

But I feel like I'm guessing at what needs to be done.

Contributor
Posts: 37

Re: start time and duration of a sequence

Hi,

 

thanks for looking into my question. To answer your questions:

 

What is the definition of a sequence? A sequence comprises a contiguous block of time, e.g. 1:01 1:02 1:03, the same value for rater1, and the same value for rater2. If the time skips a given minute or a rater's value changes, the original sequence ends. Typically, a new sequence will begin when the time will skip a value. I'd be satisifed with this solution (sequences based on this skipping). But code that watches for both skipping time and changes in a rater's value  would be very nice.

 

If enough time passes, but the raters stay the same, would that begin a new sequence? if depends, any gap larger than a minute breaks the sequence. If the time does not skip any minutes, the sequence continues.

 

Within the same sequence could the raters switch positions (so that rater #1 becomes rater #2 and vice versa)? No. If either rater changes their "report," e.g., one switches from PL to DZ, this begins a new sequence. 

 

Is duration a count of records, or does it represent a calculation based on the first and last TIME value? Duration is last time - first time + 1 minute

 

Can two sequences overlap? No, time is monotonically increasing (always increasing)

 

What is the order to the incoming data records? sorted by time

 

Three of these questions are really intertwined:  the definition of a sequence, overlapping sequences, the order to the incoming records.  They are all ways of looking at how the data identifies a sequence.

 

Your proc summary worked on my toy set. But if I add a new pair of RA DZ at 1:08, proc summary does not return the correct value. This new pair would start a new sequence

 

Thanks very much, Bruce

Solution
‎03-01-2016 08:59 AM
Super User
Posts: 5,086

Re: start time and duration of a sequence

[ Edited ]

Aha!  Thanks for the answers.

 

I suggest adding a variable to number the sequences.

 

data with_sequence;

set have;

by rater1 rater2 notsorted;

time_dif = dif(time);

if first.rater2 then sequence + 1;

else if time_dif > 60 then sequence + 1;

run;

 

You can decide if the cutoff point of > 60 needs to be adjusted or not.

 

Then summarize:

 

proc summary data=with _sequence;

by sequence rater1 rater2 notsorted;

var time;

output out=summarized (drop=_type_ _freq_) min=time_begins max=time_ends;

run;

 

You would still need to read the summary back in, to compute duration.  Something along these lines:

 

data want;

set with_sequence;

duration = time_ends - time_begins + 60;

run;

 

I expect that the TIME-based statistics are measured in seconds thus you need to add 60.  But if you test this and find that's not the case, you can always adjust the formula.

 

Good luck.

 

Oops!  Added NOTSORTED a second time.

 

Contributor
Posts: 37

Re: start time and duration of a sequence

Very nice! Thanks! Note, the data want near the bottom should be changed to "set summarized". Also, leaving the _freq_ in the proc summary provides the duration in minutes w/o needing the additional data step. Thanks a lot for the hand. I often have data with skips in the time and have struggled with numbering contiguous sequences. Now I know how.
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 234 views
  • 0 likes
  • 2 in conversation