Re: Creating new groups using hashing in SAS

Sastech · Posted 04-20-2022 01:39 PM

Hi ALL

ISSUE 1

data have;
input gids $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00
;
run;

We have the data for multiple days previously we tried to within a single day of data.

While grouping for multiple days if there is any group that is falling on multiple days like

As per the Screenshot below

Group 3 and group 4 should not match even though the gids match but they fall on different dates and there is no continuity in the time frame.

Note: when we working with multiple dates we are facing this issue. If a group starts from 25th Feb 11:55 PM to 26th Feb 00:35 AM and another group starts at 00:10 AM on 26th Feb and their gids match then it can be a single group.

ISSUE 2:

While hashing, single gid 103-19 comes under 2 different newgroup. Actually, new group 2 will not be generated for group4 which contain gids 103-17 and 103-19.

If any of the gids in the group which are already present in newgroup, then that group should go to that newgroup only.

data have;
input gids $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
103-21 1 05APR22:12:26:00 05APR22:13:02:00
103-22 1 05APR22:12:26:00 05APR22:13:02:00
102-58 2 05APR22:12:43:00 05APR22:13:06:00
103-22 2 05APR22:12:43:00 05APR22:13:06:00
103-17 4 05APR22:11:26:00 05APR22:12:34:00
103-19 4 05APR22:11:26:00 05APR22:12:34:00
103-19 5 05APR22:11:27:00 05APR22:12:52:00
103-21 5 05APR22:11:27:00 05APR22:12:52:00
;
run;

We have used the below hashing query for grouping data and this query needs to be updated for resolving both issues.

data out;
if 0 then set have;
retain newgroup max_newgroup;
if _n_ = 1
then do;
declare hash new ();
new.definekey("gids");
new.definedata("newgroup");
new.definedone();
max_newgroup = 0;
end;
ng = 0;
do until (last.group);
set have;
by group;
if new.find() = 0 then ng = newgroup;
end;
if not ng
then do;
max_newgroup + 1;
ng = max_newgroup;
end;
newgroup = ng;
do until (last.group);
set have;
by group;
if new.check() ne 0
then rc = new.add();
output;
end;
drop max_newgroup rc ng;
run;

Thanks in advance 🙂

@Ksharp @Kurt_Bremser @PGStats @art297 @ChrisNZ

Tom · Posted 04-20-2022 02:10 PM

I cannot tell what the question is. What is it that you are trying to accomplish?

Sastech · Posted 04-20-2022 02:48 PM

Hi @Tom

data contains gids group start_time and end_time

need to form a newgroup based on the below condition

If group 1 of gids matches with group 2 then-new group_id should be generated as 1 until the matches break from one group to another group and also it should check new comparing groups start time must fall on previous group start and end time.

it has to check all gids of a group for matching. If any of the gids in the group matches same newgroup should be populated.

while developing I facing above issues

mkeintz · Posted 04-20-2022 03:37 PM

It appears to me that:

you are forming the variable new_group based on a time span covered by all members of a group.
(time span is defined as the earliest start and latest end of all the observations in a group)
all members of a group will be assigned to the same new_group.
The first group will be assigned newgroup=1.
If the time span of a group is contained within the time span of a previous group, then it is assigned the same newgroup value as that previous group.
otherwise it is assigned the next available value for newgroup.
Data are already sorted by group

Is that correct?

Editted note:

I forgot to add. What if a group is not completely contained in the time span of a prior group, but simply intersects it?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sastech · Posted 04-21-2022 05:04 AM

HI @mkeintz

Clarification for 4th point and edited notes

If the start_time or end_time of a group falls between newgroup's start_time and end_time and gid's of the group also match newgroup then it should be included with the newgroup. Otherwise, it should create a new newsgroup.

All other points are the same. and please help with this.

thank you

mkeintz · Posted 04-22-2022 05:19 PM

To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment. The code below uses nothing but a comparison of the current groups minimum start_time and maximum end_time against prior newgroup start_times and end_times looking for any overlap.

It matches your expected newgroup assignments.

One caveat: when I assign the current group to a prior newgroup based on overlap of start or end time, I do NOT update the start and end time values initially determined for that prior newgroup. Is that what you want?

In other words, if

Group 1 has range 06jan2022:10:20:00 -10jan2022:15:52:00 and is assigned to newgroup=1.
Later on group 11 has a start of 09jan2022:13:00:00 and end of 12jan2022:15:00 then group 11 is assigned to newgroup=1, even though it is not completely internal to the newgroup=1 range

BUT I don't expand the definition for new group 1.
That means if group 12 has a start of 11jan2022:15:00:00, which overlaps with group 11, but not with the original definition for newgroup 1, then group 12 would be not be assigned an existing newgroup value, but would get a new newgroup value.
Is that what you want?


data have;
input gids $ group (start_time end_time) (:datetime19.)  EXPECTED;
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00     1
1-9  1 27FEB22:12:32:00 27FEB22:13:04:00     1
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00     1
1-4  2 27FEB22:10:00:00 27FEB22:10:51:00     2
1-5  2 27FEB22:10:00:00 27FEB22:10:51:00     2
1-5  3 27FEB22:10:27:00 27FEB22:11:39:00     2
1-6  3 27FEB22:10:27:00 27FEB22:11:39:00     2
1-6  4 25FEB22:09:18:00 25FEB22:09:33:00     3
1-7  4 25FEB22:09:18:00 25FEB22:09:33:00     3
1-7  5 27FEB22:11:33:00 27FEB22:12:39:00     1
1-8  5 27FEB22:11:33:00 27FEB22:12:39:00     1
1-8  6 27FEB22:11:34:00 27FEB22:12:39:00     1
1-9  6 27FEB22:11:34:00 27FEB22:12:39:00     1
;
run;

data want (drop=_: i);
  if _n_=1 then do;
    if 0 then set have;
    call missing(newgroup,_newgrp_beg,_newgrp_end);
    declare hash ngrp (ordered:'a');
      ngrp.definekey('_newgrp_beg');
      ngrp.definedata('_newgrp_beg','_newgrp_end','newgroup');
      ngrp.definedone();
    declare hiter gi ('ngrp');
  end;

  ** Read a group, determine its date range **;
  do i=1 by 1 until (last.group);
    set have;
    by group;
    _current_beg=min(_current_beg,start_time);
    _current_end=max(_current_end,end_time);
  end;


  /** Check the current date ranges against proir newgroup ranges**/
  do _rc=gi.first() by 0 while(_rc=0);
    if _current_beg>_newgrp_end or _current_end<_newgrp_beg then newgroup=.;
    if newgroup^=. then leave;
    _rc=gi.next();
  end; 

  if newgroup=. then do;   /*If no newgroup match found, then this is a new one */
    newgroup=ngrp.num_items+1;
    ngrp.add(key:_current_beg,data:_current_beg,data:_current_end,data:newgroup);
  end;

  /* Now that a newgroup has been assigned, re-read and output this group */
  do until (last.group);
    set have;
    by group;
    output;
  end;
  call missing(of _all_);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sastech · Posted 04-23-2022 01:55 PM

Hi @mkeintz

QUESTION: To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment.

ANSWER: GID is the main column to form a new group. for example, if the first group is assigned as a newgroup 1 then it will check GIDs in the GROUP 2 against the new group1. if GIDs matches are found and also time overlaps in newgroup 1 and group2, then group 2 will be associated with newgroup 1. likewise it should check all the groups and form new groups.

QUESTION: It matches your expected newgroup assignments.

ANSWER: yes it is correct for the given sample input. But we are facing discrepancies when testing with large data. it's forming newgroup only based on time overlap only and not with GID. having large data like 4M records so that time its falling same timespan. finally it ended with newgroup 1 only.

we have to consider both GID matches and Time over lapping. if both condition are not met then newgroup should be created.

Kurt_Bremser · Posted 04-23-2022 02:53 PM

With enough source data, your conditions will automatically lead to only one group. Just think about it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 04-24-2022 10:18 AM

@Sastech wrote:

Hi @mkeintz

QUESTION: To be honest, I simply don't understand how a matching GID is supposed to influence the newgroup assignment.

ANSWER: GID is the main column to form a new group. for example, if the first group is assigned as a newgroup 1 then it will check GIDs in the GROUP 2 against the new group1. if GIDs matches are found and also time overlaps in newgroup 1 and group2, then group 2 will be associated with newgroup 1. likewise it should check all the groups and form new groups.

OK, I have a clear understanding that if

a GID from group X matches a GID from a prior group T which was assigned to newgroup=1
and
group X has a time overlap with group

then group X is assigned the same newgroup (i.e. 1) as was group T.

So here are some unanswered questions.

With the above condition, does the date range for newgroup 1 expand? By that I mean, what if group Z overlaps with group X (but not group T) and has some other GID in common with X (but not T)? Does group Z then get assigned to newgroup 1?

You can see @Kurt_Bremser's comment that this could quite likely lead to 1 group.
And what do you do if group X overlaps with group T but has no common GID? What newgroup value is assigned to group X?

I clearly do not have a comprehensive understanding of what the assignment rules are for this problem.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sastech · Posted 04-25-2022 01:01 PM

HI @mkeintz

Question :

With the above condition, does the date range for newgroup 1 expand?

Answer: Yes, once any group which will be added to the new group based on gid matches and time overlap then newgroup's

time range will be expanded based on the inclusion of the group which is added.

2. By that I mean, what if group Z overlaps with group X (but not group T) and has some other GID in common with X (but not T)? Does group Z then get assigned to newgroup 1?

Answer:

First, it will assign group X (1) as newgroup 1 and it has ids-103-21 and 103-22 and start_time as 05-04-2022 12:26 &end_time as 05-04-2022 13:02

Next it should check any gids which are present in the newgroup with any other group. So here, condition will satisfy with group Z (4) and time window of group Z (4) also overlaps with newgroup.
Now newgroup has ids (103-21,103-22,103-19,103-21) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:27 & end_time as 05-04-2022 13:02.

Next, It will continue with other group which has matching id then it will check group T (2) which has 103-22 and time overlap condition satisfies so it will be added to newgroup 1.
Now Newgroup has ids (103-21,103-22,103-19,103-21,102-58,103-22) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:27 &end_time as 05-04-2022 13:06.

Next, It will continue with other group which has matching id then it will check group U (3) which has 103-19 and time overlap condition satisfies so it will be added to newgroup 1.
Now Newgroup has ids (103-21,103-22,103-19,103-21,102-58,103-22,103-17,103-19) and newgroups time window gets expanded as start_time as start_time as 05-04-2022 11:26 &end_time as 05-04-2022 13:06.

3.And what do you do if group X overlaps with group T but has no common GID? What newgroup value is assigned to group X?

Answer: No group will be added with new group until both the conditions (ids in group matches with new group and time overlap with new group) are met.

If any of this conditions are not met then the group will be assigned with newgroup.

Thank You

@mkeintz @Kurt_Bremser

Tom · Posted 04-23-2022 02:26 PM

I don't see how a HASH object can help with this problem. The keys to a hash object need to be exact matches. But your problem seems to need to test whether datetime intervals are overlapping or not.

Tom · Posted 04-23-2022 03:26 PM

Can you describe what this data represents? Perhaps understanding the real world case will make it clearer what you are trying to do.

Looking at the data you posted it looks like the start/stop times are for the GROUP not the individual rows.

You have IDs that are mapped to multiple groups.

You have time windows for some groups that overlap.

So is the goal to reduce the number of groups by combining groups where the time overlaps?

Or combining groups where the membership overlaps?
Or both?

Let's eliminate some of the complexity of by mapping the datetime values to simple integers that reflect the relative order. Also let's simplify the ID values so they are just simple letters rather the the hyphenated strings. So if we sort by ID and GROUP we get this input data.

data have;
  input id $ group start end ;
cards;
A 2  3  5
B 2  3  5
B 3  4  8
C 3  4  8
C 4  1  2
D 4  1  2
D 5  6 11
E 5  6 11
E 6  7 11
F 1  9 12
F 6  7 11
G 1  9 12
H 1 10 13
;

If you want to combine based on the existing groups then there is only one group here. Start with group 1 That merges with 6 because ID=F which merges with 5 because of ID=E which merges with 4 because of ID=D which merges with 3 because of ID=C and merges with 2 because ID=B.

If you want to combine based on times then there are two groups. The original group 4 which went between the two smallest time points will be a group by itself since it does not overlap any others. All of the other groups overlap in some way so will become the second new group.

SAS Innovate 2025: Call for Content

Classroom Training Available!