Hi ALL
ISSUE 1
data have;
input gids $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00
;
run;
We have the data for multiple days previously we tried to within a single day of data.
While grouping for multiple days if there is any group that is falling on multiple days like
As per the Screenshot below
Group 3 and group 4 should not match even though the gids match but they fall on different dates and there is no continuity in the time frame.
Note: when we working with multiple dates we are facing this issue. If a group starts from 25th Feb 11:55 PM to 26th Feb 00:35 AM and another group starts at 00:10 AM on 26th Feb and their gids match then it can be a single group.
ISSUE 2:
While hashing, single gid 103-19 comes under 2 different newgroup. Actually, new group 2 will not be generated for group4 which contain gids 103-17 and 103-19.
If any of the gids in the group which are already present in newgroup, then that group should go to that newgroup only.
data have;
input gids $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
103-21 1 05APR22:12:26:00 05APR22:13:02:00
103-22 1 05APR22:12:26:00 05APR22:13:02:00
102-58 2 05APR22:12:43:00 05APR22:13:06:00
103-22 2 05APR22:12:43:00 05APR22:13:06:00
103-17 4 05APR22:11:26:00 05APR22:12:34:00
103-19 4 05APR22:11:26:00 05APR22:12:34:00
103-19 5 05APR22:11:27:00 05APR22:12:52:00
103-21 5 05APR22:11:27:00 05APR22:12:52:00
;
run;
We have used the below hashing query for grouping data and this query needs to be updated for resolving both issues.
data out;
if 0 then set have;
retain newgroup max_newgroup;
if _n_ = 1
then do;
declare hash new ();
new.definekey("gids");
new.definedata("newgroup");
new.definedone();
max_newgroup = 0;
end;
ng = 0;
do until (last.group);
set have;
by group;
if new.find() = 0 then ng = newgroup;
end;
if not ng
then do;
max_newgroup + 1;
ng = max_newgroup;
end;
newgroup = ng;
do until (last.group);
set have;
by group;
if new.check() ne 0
then rc = new.add();
output;
end;
drop max_newgroup rc ng;
run;
Thanks in advance 🙂
I cannot tell what the question is. What is it that you are trying to accomplish?
It appears to me that:
Is that correct?
Editted note:
I forgot to add. What if a group is not completely contained in the time span of a prior group, but simply intersects it?
HI @mkeintz
Clarification for 4th point and edited notes
If the start_time or end_time of a group falls between newgroup's start_time and end_time and gid's of the group also match newgroup then it should be included with the newgroup. Otherwise, it should create a new newsgroup.
All other points are the same. and please help with this.
thank you
To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment. The code below uses nothing but a comparison of the current groups minimum start_time and maximum end_time against prior newgroup start_times and end_times looking for any overlap.
It matches your expected newgroup assignments.
One caveat: when I assign the current group to a prior newgroup based on overlap of start or end time, I do NOT update the start and end time values initially determined for that prior newgroup. Is that what you want?
In other words, if
data have;
input gids $ group (start_time end_time) (:datetime19.) EXPECTED;
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00 1
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00 1
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00 1
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00 2
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00 2
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00 2
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00 2
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00 3
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00 3
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00 1
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00 1
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00 1
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00 1
;
run;
data want (drop=_: i);
if _n_=1 then do;
if 0 then set have;
call missing(newgroup,_newgrp_beg,_newgrp_end);
declare hash ngrp (ordered:'a');
ngrp.definekey('_newgrp_beg');
ngrp.definedata('_newgrp_beg','_newgrp_end','newgroup');
ngrp.definedone();
declare hiter gi ('ngrp');
end;
** Read a group, determine its date range **;
do i=1 by 1 until (last.group);
set have;
by group;
_current_beg=min(_current_beg,start_time);
_current_end=max(_current_end,end_time);
end;
/** Check the current date ranges against proir newgroup ranges**/
do _rc=gi.first() by 0 while(_rc=0);
if _current_beg>_newgrp_end or _current_end<_newgrp_beg then newgroup=.;
if newgroup^=. then leave;
_rc=gi.next();
end;
if newgroup=. then do; /*If no newgroup match found, then this is a new one */
newgroup=ngrp.num_items+1;
ngrp.add(key:_current_beg,data:_current_beg,data:_current_end,data:newgroup);
end;
/* Now that a newgroup has been assigned, re-read and output this group */
do until (last.group);
set have;
by group;
output;
end;
call missing(of _all_);
run;
Hi @mkeintz
QUESTION: To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment.
ANSWER: GID is the main column to form a new group. for example, if the first group is assigned as a newgroup 1 then it will check GIDs in the GROUP 2 against the new group1. if GIDs matches are found and also time overlaps in newgroup 1 and group2, then group 2 will be associated with newgroup 1. likewise it should check all the groups and form new groups.
QUESTION: It matches your expected newgroup assignments.
ANSWER: yes it is correct for the given sample input. But we are facing discrepancies when testing with large data. it's forming newgroup only based on time overlap only and not with GID. having large data like 4M records so that time its falling same timespan. finally it ended with newgroup 1 only.
we have to consider both GID matches and Time over lapping. if both condition are not met then newgroup should be created.
With enough source data, your conditions will automatically lead to only one group. Just think about it.
@Sastech wrote:
Hi @mkeintz
QUESTION: To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment.
ANSWER: GID is the main column to form a new group. for example, if the first group is assigned as a newgroup 1 then it will check GIDs in the GROUP 2 against the new group1. if GIDs matches are found and also time overlaps in newgroup 1 and group2, then group 2 will be associated with newgroup 1. likewise it should check all the groups and form new groups.
OK, I have a clear understanding that if
then group X is assigned the same newgroup (i.e. 1) as was group T.
So here are some unanswered questions.
I clearly do not have a comprehensive understanding of what the assignment rules are for this problem.
HI @mkeintz
Question :
Answer: Yes, once any group which will be added to the new group based on gid matches and time overlap then newgroup's
time range will be expanded based on the inclusion of the group which is added.
2. By that I mean, what if group Z overlaps with group X (but not group T) and has some other GID in common with X (but not T)? Does group Z then get assigned to newgroup 1?
Answer:
First, it will assign group X (1) as newgroup 1 and it has ids-103-21 and 103-22 and start_time as 05-04-2022 12:26 &end_time as 05-04-2022 13:02
Next it should check any gids which are present in the newgroup with any other group. So here, condition will satisfy with group Z (4) and time window of group Z (4) also overlaps with newgroup.
Now newgroup has ids (103-21,103-22,103-19,103-21) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:27 & end_time as 05-04-2022 13:02.
Next, It will continue with other group which has matching id then it will check group T (2) which has 103-22 and time overlap condition satisfies so it will be added to newgroup 1.
Now Newgroup has ids (103-21,103-22,103-19,103-21,102-58,103-22) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:27 &end_time as 05-04-2022 13:06.
Next, It will continue with other group which has matching id then it will check group U (3) which has 103-19 and time overlap condition satisfies so it will be added to newgroup 1.
Now Newgroup has ids (103-21,103-22,103-19,103-21,102-58,103-22,103-17,103-19) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:26 &end_time as 05-04-2022 13:06.
3.And what do you do if group X overlaps with group T but has no common GID? What newgroup value is assigned to group X?
Answer: No group will be added with new group until both the conditions (ids in group matches with new group and time overlap with new group) are met.
If any of this conditions are not met then the group will be assigned with newgroup.
Thank You
I don't see how a HASH object can help with this problem. The keys to a hash object need to be exact matches. But your problem seems to need to test whether datetime intervals are overlapping or not.
Can you describe what this data represents? Perhaps understanding the real world case will make it clearer what you are trying to do.
Looking at the data you posted it looks like the start/stop times are for the GROUP not the individual rows.
You have IDs that are mapped to multiple groups.
You have time windows for some groups that overlap.
So is the goal to reduce the number of groups by combining groups where the time overlaps?
Or combining groups where the membership overlaps?
Or both?
Let's eliminate some of the complexity of by mapping the datetime values to simple integers that reflect the relative order. Also let's simplify the ID values so they are just simple letters rather the the hyphenated strings. So if we sort by ID and GROUP we get this input data.
data have;
input id $ group start end ;
cards;
A 2 3 5
B 2 3 5
B 3 4 8
C 3 4 8
C 4 1 2
D 4 1 2
D 5 6 11
E 5 6 11
E 6 7 11
F 1 9 12
F 6 7 11
G 1 9 12
H 1 10 13
;
If you want to combine based on the existing groups then there is only one group here. Start with group 1 That merges with 6 because ID=F which merges with 5 because of ID=E which merges with 4 because of ID=D which merges with 3 because of ID=C and merges with 2 because ID=B.
If you want to combine based on times then there are two groups. The original group 4 which went between the two smallest time points will be a group by itself since it does not overlap any others. All of the other groups overlap in some way so will become the second new group.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.