Re: Forming new group based on the sub group

Sastech · Posted 04-21-2022 12:55 PM

Hi All

data have;
input id $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00
103-21 7 05APR22:12:26:00 05APR22:13:02:00
103-22 7 05APR22:12:26:00 05APR22:13:02:00
102-58 8 05APR22:12:43:00 05APR22:13:06:00
103-22 8 05APR22:12:43:00 05APR22:13:06:00
103-17 9 05APR22:11:26:00 05APR22:12:34:00
103-19 9 05APR22:11:26:00 05APR22:12:34:00
103-19 10 05APR22:11:27:00 05APR22:12:52:00
103-21 10 05APR22:11:27:00 05APR22:12:52:00
;
run;

Expected Output :

1. We are forming the variable new_group based on a time span covered by all members of a group.
(time span is defined as the earliest start and latest end of all the observations in a group)
2. all members of a group will be assigned to the same new_group.
3. The first group will be assigned newgroup=1.
4. If the earliest start_time or last end_time of a group falls in between already formed newgroup's earliest start_time and last end_time and also id's of the group also match newgroup then it should be included with the newgroup. Otherwise, it should create a new newsgroup.

Note:
1.Eventhough group 2 and 3 have same ids , they should not come under same new_group because both groups are having different time and they did not intersect/overlap.
2. Single id value should not come under multiple new_groups.

PaigeMiller · Posted 04-21-2022 01:23 PM

I'm not sure I grasp the method of creating this new group variable.

The first 3 rows have new_group=1, based on what logic? Please explain.

The next 4 rows have new_group=2, based on what logic? Please explain.

--
Paige Miller

Sastech · Posted 04-21-2022 01:55 PM

The first 3 rows have new_group=1 because group 1 of ids does not match with the any other group of ids so its starts with new_group 1

The next 4 rows have new_group=2 beacuse from group 2 and 3 the id 1-5 is common in group 2 and 3 so its coming under the new_group 2.
From group 3 and 4 the id 1-6 is common in the both the group but time span is different so its not joining into same new_group 2 and its new_group 3
From group 4 and 5 the id 1-7 is common in the both the group but time span is different so its not joining into same new_group 3 and its new_group 4

PaigeMiller · Posted 04-21-2022 01:57 PM

So we only have to look at the next group, we would never have to look at the group that is five groups away?

--
Paige Miller

Sastech · Posted 04-21-2022 02:01 PM

we have to look next common ids from one group to other group. if we see group 7 and the id 103 -21 is match with group 10 and coming under the same newgroup 5

Sastech · Posted 04-22-2022 07:17 AM

Hi @PaigeMiller
except timespan everything is working as expected
how to alter the the code for across the days.

data out;
if 0 then set innerjoin_table_new1_t;
retain newgroup max_newgroup;
if _n_ = 1
then do;
declare hash new ();
new.definekey("gtmds_uid");
new.definedata("newgroup");
new.definedone();
max_newgroup = 0;
end;
ng = 0;
do until (last.group);
set innerjoin_table_new1_t;
by group;
if new.find() = 0 then ng = newgroup;
end;
if not ng
then do;
max_newgroup + 1;
ng = max_newgroup;
end;
newgroup = ng;
do until (last.group);
set innerjoin_table_new1_t;
by group;
if new.check() ne 0
then rc = new.add();
output;
end;
drop max_newgroup rc ng;
run;

Kurt_Bremser · Posted 04-22-2022 08:19 AM

Your code uses gtmds_uid as key of the hash, but this variable does not exist in the example dataset you posted.

Does id play any role in the building of new groups, or is it only the date that defines a group?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sastech · Posted 04-22-2022 08:30 AM

Hi @Kurt_Bremser
Updated the code shared below.
1.Eventhough group 2 and 3 have same ids , they should not come under same new_group because both groups are having different time and they did not intersect/overlap.
2. Single id value should not come under multiple new_groups.

data out;
if 0 then set have;
retain newgroup max_newgroup;
if _n_ = 1
then do;
declare hash new ();
new.definekey("id");
new.definedata("newgroup");
new.definedone();
max_newgroup = 0;
end;
ng = 0;
do until (last.group);
set innerjoin_table_new1_t;
by group;
if new.find() = 0 then ng = newgroup;
end;
if not ng
then do;
max_newgroup + 1;
ng = max_newgroup;
end;
newgroup = ng;
do until (last.group);
set innerjoin_table_new1_t;
by group;
if new.check() ne 0
then rc = new.add();
output;
end;
drop max_newgroup rc ng;
run;

Kurt_Bremser · Posted 04-22-2022 08:34 AM

You perfectly contradict yourself. In 1) you say that same id's should belong to different newgroups (under certain conditions), but in 2) you say that a single id should always be in the same newgroup.

Please make up your mind.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sastech · Posted 04-23-2022 02:17 PM

Hi @Kurt_Bremser

1-6 and 1- 7 (Group 4) need to be set to be with a separate newgroup as this window 25-02-2022 09:18:00 to 25-02-2022 09:33:00 happens at a separate time window which is not overlapping with any other newgroups time window.

Note: 1. If the time window of the group intersects/overlaps with newgroup and ids match, then the group will be added with the same newgroup. 2. If ids are not matching or time windows are not overlapping, then that group will be generated with newgroup number

SASTECH1 · Posted 04-21-2022 12:47 PM

Hi All,

The data needs to be grouped based on conditions.

data have;
input id $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00
103-21 7 05APR22:12:26:00 05APR22:13:02:00
103-22 7 05APR22:12:26:00 05APR22:13:02:00
102-58 8 05APR22:12:43:00 05APR22:13:06:00
103-22 8 05APR22:12:43:00 05APR22:13:06:00
103-17 9 05APR22:11:26:00 05APR22:12:34:00
103-19 9 05APR22:11:26:00 05APR22:12:34:00
103-19 10 05APR22:11:27:00 05APR22:12:52:00
103-21 10 05APR22:11:27:00 05APR22:12:52:00
;
run;

Expected Output:

Grouping needs to be done based on below conditions.

1. We are forming the variable new_group based on a time span covered by all members of a group.
(time span is defined as the earliest start and latest end of all the observations in a group)
2. all members of a group will be assigned to the same new_group.
3. The first group will be assigned newgroup=1.
4. If the earliest start_time or last end_time of a group falls in between already formed newgroup's earliest start_time and last end_time and also id's of the group also match newgroup then it should be included with the newgroup. Otherwise, it should create a new newsgroup.

Note:
1.Eventhough group 2 and 3 have same ids , they should not come under same new_group because both groups are having different time and they did not intersect/overlap.
2. Single id value should not come under multiple new_groups.

Please help on this.

Thank you.

ChrisNZ · Posted 04-22-2022 02:09 AM

Why is GROUP 5 not in NEWGROUP 2 since it overlaps?

High-Performance SAS Coding - Third Edition

SASTECH1 · Posted 04-22-2022 04:12 AM

The ids in group 5 (1-7 and 1-8) are not in Newgroup -2. So only it is not in newgroup2 though time window overlap.

New groups should be assigned based on both conditions checking ids present in it and also time overlap.

Thank you

@ChrisNZ

andreas_lds · Posted 04-22-2022 03:05 AM

Please don't double post questions.

average_joe · Posted 04-25-2022 05:22 PM

I'm not sure about the interaction between IDs and formation of new groups because some of the info shared in this post seems inconsistent or contradictory. However, if the current group's range of time overlaps the new group's range of time, you can get the desired outcome.

* determine each group's min start and max end;
proc summary data=have nway;
class group;
output out=grp_range (drop=_:) min(start_time)=min_grp_start max(end_time)=max_grp_end;
run;

* determine new groups based only on overlaps of time;
data new_groups;
set grp_range;
retain new_group 0 min_new_grp_start max_new_grp_end;
format min_new_grp_start max_new_grp_end e8601dt19.;

* test if current group overlaps the current new_group;
* if not, then create a new group and reset the new group dates;
if not (min_grp_start <= max_new_grp_end and max_grp_end >= min_new_grp_start) then do;
   new_group + 1;
   min_new_grp_start = min_grp_start;
   max_new_grp_end = max_grp_end;
end;

* dynamically calc new group min and max;
min_new_grp_start = min(min_new_grp_start, min_grp_start);
max_new_grp_end = max(max_new_grp_end, max_grp_end);
run;

Of course, a much larger sample is needed to fully vet this logic.

Forming new group based on the sub group