hello,
I have data that includes a time variable and weather descriptions from rater1 and rater2, e.g.,
data have;
input time : time5. rater1 : $ rater2 : $ ;
format time time5. ;
cards;
1:01 RA DZ
1:02 RA DZ
1:03 RA DZ
2:06 DZ PL
2:07 DZ PL
2:15 PL
;
run;
these sequences can go quite long (not just three minutes, but maybe three hours), and there are many of them. What I'd like is a summary of each, something like:
data want;
input startTime : time5. duration rater1 : $ rater2 : $ ;
format time time5. ;
cards;
1:01 3 RA DZ
2:06 2 DZ PL
2:15 1 PL
;
run;
rater1 and rater2 values include RA, DZ, PL, ' ', SN, RAPL, SNPL, and RASN. Any suggestions?
Thanks, Bruce
Aha! Thanks for the answers.
I suggest adding a variable to number the sequences.
data with_sequence;
set have;
by rater1 rater2 notsorted;
time_dif = dif(time);
if first.rater2 then sequence + 1;
else if time_dif > 60 then sequence + 1;
run;
You can decide if the cutoff point of > 60 needs to be adjusted or not.
Then summarize:
proc summary data=with _sequence;
by sequence rater1 rater2 notsorted;
var time;
output out=summarized (drop=_type_ _freq_) min=time_begins max=time_ends;
run;
You would still need to read the summary back in, to compute duration. Something along these lines:
data want;
set with_sequence;
duration = time_ends - time_begins + 60;
run;
I expect that the TIME-based statistics are measured in seconds thus you need to add 60. But if you test this and find that's not the case, you can always adjust the formula.
Good luck.
Oops! Added NOTSORTED a second time.
Bruce,
Your question leaves a number of questions open to interpretation. Perhaps you could narrow down the problem by addressing a few of these.
What is the definition of a sequence?
If enough time passes, but the raters stay the same, would that begin a new sequence?
Within the same sequence could the raters switch positions (so that rater #1 becomes rater #2 and vice versa)?
Is duration a count of records, or does it represent a calculation based on the first and last TIME value?
Can two sequences overlap?
What is the order to the incoming data records?
Three of these questions are really intertwined: the definition of a sequence, overlapping sequences, the order to the incoming records. They are all ways of looking at how the data identifies a sequence.
The program might be as simple as:
proc summary data=have nway;
class rater1 rater2;
var time;
output out=want (drop-_type_ rename=(_freq_=duration)) min=start_time;
run;
But I feel like I'm guessing at what needs to be done.
Hi,
thanks for looking into my question. To answer your questions:
What is the definition of a sequence? A sequence comprises a contiguous block of time, e.g. 1:01 1:02 1:03, the same value for rater1, and the same value for rater2. If the time skips a given minute or a rater's value changes, the original sequence ends. Typically, a new sequence will begin when the time will skip a value. I'd be satisifed with this solution (sequences based on this skipping). But code that watches for both skipping time and changes in a rater's value would be very nice.
If enough time passes, but the raters stay the same, would that begin a new sequence? if depends, any gap larger than a minute breaks the sequence. If the time does not skip any minutes, the sequence continues.
Within the same sequence could the raters switch positions (so that rater #1 becomes rater #2 and vice versa)? No. If either rater changes their "report," e.g., one switches from PL to DZ, this begins a new sequence.
Is duration a count of records, or does it represent a calculation based on the first and last TIME value? Duration is last time - first time + 1 minute
Can two sequences overlap? No, time is monotonically increasing (always increasing)
What is the order to the incoming data records? sorted by time
Three of these questions are really intertwined: the definition of a sequence, overlapping sequences, the order to the incoming records. They are all ways of looking at how the data identifies a sequence.
Your proc summary worked on my toy set. But if I add a new pair of RA DZ at 1:08, proc summary does not return the correct value. This new pair would start a new sequence
Thanks very much, Bruce
Aha! Thanks for the answers.
I suggest adding a variable to number the sequences.
data with_sequence;
set have;
by rater1 rater2 notsorted;
time_dif = dif(time);
if first.rater2 then sequence + 1;
else if time_dif > 60 then sequence + 1;
run;
You can decide if the cutoff point of > 60 needs to be adjusted or not.
Then summarize:
proc summary data=with _sequence;
by sequence rater1 rater2 notsorted;
var time;
output out=summarized (drop=_type_ _freq_) min=time_begins max=time_ends;
run;
You would still need to read the summary back in, to compute duration. Something along these lines:
data want;
set with_sequence;
duration = time_ends - time_begins + 60;
run;
I expect that the TIME-based statistics are measured in seconds thus you need to add 60. But if you test this and find that's not the case, you can always adjust the formula.
Good luck.
Oops! Added NOTSORTED a second time.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.