Solved: Re: New group should be created as a column based on the matches of th...

Sastech · Posted 04-05-2022 06:49 AM

If group 1 of gids matches with group 2 then-new group_id should be generated as 1 until the matches break from one group to another group and also it should check new comparing groups start time must fall on previous group start and end time.

Now, groups, 1 and 2 will be under the same new_group 1, and the minimum start time and maximum end time of both groups will be taken as start and end times for the next group check.

For Example, if group 10 will not match with the previous group’s ids and the time window also does not match with any other new_group window .so group 10 will be a separate group then-new group_id is 2.

We have provided the screenshot of the input and output for grouping.

Note: start_time and end_time are in character in the below query but it is in datetime20. Format

In our data.

data have;
input group 1-2 gids $4-8 start_time $10-25 end_time $27-42 ;
datalines;
1 85-51 16FEB22:00:03:00 16FEB22:00:55:00
1 85-52 16FEB22:00:03:00 16FEB22:00:55:00
2 85-52 16FEB22:00:07:00 16FEB22:01:38:00
2 85-53 16FEB22:00:07:00 16FEB22:01:38:00
3 85-53 16FEB22:00:16:00 16FEB22:01:40:00
3 85-54 16FEB22:00:16:00 16FEB22:01:40:00
4 85-54 16FEB22:01:18:00 16FEB22:01:58:00
4 85-55 16FEB22:01:18:00 16FEB22:01:58:00
5 85-55 16FEB22:01:20:00 16FEB22:02:51:00
5 85-56 16FEB22:01:20:00 16FEB22:02:51:00
6 85-56 16FEB22:01:21:00 16FEB22:03:50:00
6 85-57 16FEB22:01:21:00 16FEB22:03:50:00
7 85-57 16FEB22:02:35:00 16FEB22:04:17:00
7 85-58 16FEB22:02:35:00 16FEB22:04:17:00
8 85-58 16FEB22:02:47:00 16FEB22:05:39:00
8 85-60 16FEB22:02:47:00 16FEB22:05:39:00
9 85-60 16FEB22:03:59:00 16FEB22:05:39:00
9 89-51 16FEB22:03:59:00 16FEB22:05:39:00
10 85-66 16FEB22:10:11:00 16FEB22:10:52:00
10 85-67 16FEB22:10:11:00 16FEB22:10:52:00
11 89-51 16FEB22:05:01:00 16FEB22:06:10:00
11 89-52 16FEB22:05:01:00 16FEB22:06:10:00
12 89-52 16FEB22:05:02:00 16FEB22:06:10:00
12 89-53 16FEB22:05:02:00 16FEB22:06:10:00
13 89-72 16FEB22:13:57:00 16FEB22:15:19:00
13 89-73 16FEB22:13:57:00 16FEB22:15:19:00
13 89-75 16FEB22:13:57:00 16FEB22:15:19:00
14 89-73 16FEB22:14:00:00 16FEB22:15:19:00
14 89-75 16FEB22:14:00:00 16FEB22:15:19:00
14 89-76 16FEB22:14:00:00 16FEB22:15:19:00
15 89-80 16FEB22:21:11:00 16FEB22:23:05:00
15 89-81 16FEB22:21:11:00 16FEB22:23:05:00
16 89-81 16FEB22:21:13:00 17FEB22:00:31:00
16 89-82 16FEB22:21:13:00 17FEB22:00:31:00
;
run;

screenshot of input and output

@Kurt_Bremser

Thank You in advance

Kurt_Bremser · Posted 04-05-2022 08:11 AM

I made some changes to your data step, so we get real datetime values, but this is mostly irrelevant for the solution. Just so you can see how such values are read from datalines.

data have;
input group gids $ (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1 85-51 16FEB22:00:03:00 16FEB22:00:55:00
1 85-52 16FEB22:00:03:00 16FEB22:00:55:00
2 85-52 16FEB22:00:07:00 16FEB22:01:38:00
2 85-53 16FEB22:00:07:00 16FEB22:01:38:00
3 85-53 16FEB22:00:16:00 16FEB22:01:40:00
3 85-54 16FEB22:00:16:00 16FEB22:01:40:00
4 85-54 16FEB22:01:18:00 16FEB22:01:58:00
4 85-55 16FEB22:01:18:00 16FEB22:01:58:00
5 85-55 16FEB22:01:20:00 16FEB22:02:51:00
5 85-56 16FEB22:01:20:00 16FEB22:02:51:00
6 85-56 16FEB22:01:21:00 16FEB22:03:50:00
6 85-57 16FEB22:01:21:00 16FEB22:03:50:00
7 85-57 16FEB22:02:35:00 16FEB22:04:17:00
7 85-58 16FEB22:02:35:00 16FEB22:04:17:00
8 85-58 16FEB22:02:47:00 16FEB22:05:39:00
8 85-60 16FEB22:02:47:00 16FEB22:05:39:00
9 85-60 16FEB22:03:59:00 16FEB22:05:39:00
9 89-51 16FEB22:03:59:00 16FEB22:05:39:00
10 85-66 16FEB22:10:11:00 16FEB22:10:52:00
10 85-67 16FEB22:10:11:00 16FEB22:10:52:00
11 89-51 16FEB22:05:01:00 16FEB22:06:10:00
11 89-52 16FEB22:05:01:00 16FEB22:06:10:00
12 89-52 16FEB22:05:02:00 16FEB22:06:10:00
12 89-53 16FEB22:05:02:00 16FEB22:06:10:00
13 89-72 16FEB22:13:57:00 16FEB22:15:19:00
13 89-73 16FEB22:13:57:00 16FEB22:15:19:00
13 89-75 16FEB22:13:57:00 16FEB22:15:19:00
14 89-73 16FEB22:14:00:00 16FEB22:15:19:00
14 89-75 16FEB22:14:00:00 16FEB22:15:19:00
14 89-76 16FEB22:14:00:00 16FEB22:15:19:00
15 89-80 16FEB22:21:11:00 16FEB22:23:05:00
15 89-81 16FEB22:21:11:00 16FEB22:23:05:00
16 89-81 16FEB22:21:13:00 17FEB22:00:31:00
16 89-82 16FEB22:21:13:00 17FEB22:00:31:00
;

From that, use a hash object to keep track of new groups:

data want;
set have;
by group;
retain newgroup max_newgroup;
if _n_ = 1
then do; /* build a dynamic object which keeps track of gids to newgroup relationship */
  declare hash new ();
  new.definekey("gids");
  new.definedata("newgroup");
  new.definedone();
  max_newgroup = 0;
end;
if first.group and new.find() ne 0
/* at the start of a group, we can't find a match, so we create a new group */
then do;
  max_newgroup + 1;
  newgroup = max_newgroup;
  rc = new.add();
end;
if new.find() ne 0 /* current gids is not found in the hash */
/* since newgroup is either taken from the hash or newly set
  at the start of a group, and retained, we can now add it */
then rc = new.add();
drop max_newgroup rc;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 04-05-2022 08:11 AM

I made some changes to your data step, so we get real datetime values, but this is mostly irrelevant for the solution. Just so you can see how such values are read from datalines.

data have;
input group gids $ (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1 85-51 16FEB22:00:03:00 16FEB22:00:55:00
1 85-52 16FEB22:00:03:00 16FEB22:00:55:00
2 85-52 16FEB22:00:07:00 16FEB22:01:38:00
2 85-53 16FEB22:00:07:00 16FEB22:01:38:00
3 85-53 16FEB22:00:16:00 16FEB22:01:40:00
3 85-54 16FEB22:00:16:00 16FEB22:01:40:00
4 85-54 16FEB22:01:18:00 16FEB22:01:58:00
4 85-55 16FEB22:01:18:00 16FEB22:01:58:00
5 85-55 16FEB22:01:20:00 16FEB22:02:51:00
5 85-56 16FEB22:01:20:00 16FEB22:02:51:00
6 85-56 16FEB22:01:21:00 16FEB22:03:50:00
6 85-57 16FEB22:01:21:00 16FEB22:03:50:00
7 85-57 16FEB22:02:35:00 16FEB22:04:17:00
7 85-58 16FEB22:02:35:00 16FEB22:04:17:00
8 85-58 16FEB22:02:47:00 16FEB22:05:39:00
8 85-60 16FEB22:02:47:00 16FEB22:05:39:00
9 85-60 16FEB22:03:59:00 16FEB22:05:39:00
9 89-51 16FEB22:03:59:00 16FEB22:05:39:00
10 85-66 16FEB22:10:11:00 16FEB22:10:52:00
10 85-67 16FEB22:10:11:00 16FEB22:10:52:00
11 89-51 16FEB22:05:01:00 16FEB22:06:10:00
11 89-52 16FEB22:05:01:00 16FEB22:06:10:00
12 89-52 16FEB22:05:02:00 16FEB22:06:10:00
12 89-53 16FEB22:05:02:00 16FEB22:06:10:00
13 89-72 16FEB22:13:57:00 16FEB22:15:19:00
13 89-73 16FEB22:13:57:00 16FEB22:15:19:00
13 89-75 16FEB22:13:57:00 16FEB22:15:19:00
14 89-73 16FEB22:14:00:00 16FEB22:15:19:00
14 89-75 16FEB22:14:00:00 16FEB22:15:19:00
14 89-76 16FEB22:14:00:00 16FEB22:15:19:00
15 89-80 16FEB22:21:11:00 16FEB22:23:05:00
15 89-81 16FEB22:21:11:00 16FEB22:23:05:00
16 89-81 16FEB22:21:13:00 17FEB22:00:31:00
16 89-82 16FEB22:21:13:00 17FEB22:00:31:00
;

From that, use a hash object to keep track of new groups:

data want;
set have;
by group;
retain newgroup max_newgroup;
if _n_ = 1
then do; /* build a dynamic object which keeps track of gids to newgroup relationship */
  declare hash new ();
  new.definekey("gids");
  new.definedata("newgroup");
  new.definedone();
  max_newgroup = 0;
end;
if first.group and new.find() ne 0
/* at the start of a group, we can't find a match, so we create a new group */
then do;
  max_newgroup + 1;
  newgroup = max_newgroup;
  rc = new.add();
end;
if new.find() ne 0 /* current gids is not found in the hash */
/* since newgroup is either taken from the hash or newly set
  at the start of a group, and retained, we can now add it */
then rc = new.add();
drop max_newgroup rc;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sastech · Posted 04-05-2022 09:30 AM

Hi Kurt,

Thank you for the quick response, the code which is sent is working fine with the testing data which we have

and we will check with the actual data which contains 4M rows for different days.

Sastech · Posted 04-06-2022 02:37 PM

Problem

data have;
input group gids $ (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1 1-10 16FEB22:08:32:00 16FEB22:08:58:00
1 1-11 16FEB22:08:32:00 16FEB22:08:58:00
1 1-8 16FEB22:07:48:00 16FEB22:08:36:00
2 1-11 16FEB22:08:33:00 16FEB22:09:16:00
2 1-12 16FEB22:08:33:00 16FEB22:09:16:00
6 1-8 16FEB22:07:28:00 16FEB22:08:35:00
6 2-57 16FEB22:07:28:00 16FEB22:08:35:00
146 2-56 16FEB22:07:05:00 16FEB22:07:54:00
146 2-57 16FEB22:07:05:00 16FEB22:07:54:00
;
run;
As per the below code , it searches only first gid of group only but it has to check all gids of a group for matching. If any of the gids in the group matches same newgroup should be populated.

if first.group and new.find() ne 0
/* at the start of a group, we can't find a match, so we create a new group */
then do;
max_newgroup + 1;
newgroup = max_newgroup;
rc = new.add();

As result of this, 2-56 gid alone goes to new group when we analyzed the code with another set of data. But all belongs to same group. Could you please let us know how to resolve it.

@Kurt_Bremser

Kurt_Bremser · Posted 04-07-2022 02:15 AM

So we need to first check each group if we can find any assigned newgroup, and then process the group again to write newgroup:

data want;
if 0 then set have; /* defines variables into the PDV, does nothing else */
retain newgroup max_newgroup;
if _n_ = 1
then do; /* build a dynamic object which keeps track of gids to newgroup relationship */
  declare hash new ();
  new.definekey("gids");
  new.definedata("newgroup");
  new.definedone();
  max_newgroup = 0;
end;
ng = 0;
do until (last.group);
  set have;
  by group;
  if new.find() = 0 then ng = newgroup;
end;
if not ng
then do;
  max_newgroup + 1;
  ng = max_newgroup;
end;
newgroup = ng;
do until (last.group);
  set have;
  by group;
  if new.check() ne 0 /* current gids is not found in the hash */
  then rc = new.add();
  output;
end;
drop max_newgroup rc ng;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sastech · Posted 04-19-2022 05:14 AM

Hi Kurt

data have;
input gids $ group (start_time end_time) (:datetime19.);
format start_time end_time e8601dt19.;
datalines;
1-10 1 27FEB22:12:32:00 27FEB22:13:04:00
1-9 1 27FEB22:12:32:00 27FEB22:13:04:00
2-15 1 27FEB22:12:33:00 27FEB22:13:12:00
1-4 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 2 27FEB22:10:00:00 27FEB22:10:51:00
1-5 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 3 27FEB22:10:27:00 27FEB22:11:39:00
1-6 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 4 25FEB22:09:18:00 25FEB22:09:33:00
1-7 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 5 27FEB22:11:33:00 27FEB22:12:39:00
1-8 6 27FEB22:11:34:00 27FEB22:12:39:00
1-9 6 27FEB22:11:34:00 27FEB22:12:39:00
;
run;

We have the data for multiple days previously we tried to within a single day of data.

While grouping for multiple days if there is any group that is falling on multiple days like

As per the Screenshot below

Group 3 and group 4 should not match even though the gids match but they fall on different dates and there is no continuity in the time frame.

Note: when we working with multiple dates we are facing this issue. If a group starts from 25th Feb 11:55 PM to 26th Feb 00:35 AM and another group starts at 00:10 AM on 26th Feb and their gids match then it can be a single group.

@Kurt_Bremser

Ksharp · Posted 04-05-2022 08:29 AM

data have;
input group 1-2 gids $4-8 start_time : datetime20. end_time : datetime20.  ;
format start_time  end_time : datetime20. ;
datalines;
1 85-51 16FEB22:00:03:00 16FEB22:00:55:00
1 85-52 16FEB22:00:03:00 16FEB22:00:55:00
2 85-52 16FEB22:00:07:00 16FEB22:01:38:00
2 85-53 16FEB22:00:07:00 16FEB22:01:38:00
3 85-53 16FEB22:00:16:00 16FEB22:01:40:00
3 85-54 16FEB22:00:16:00 16FEB22:01:40:00
4 85-54 16FEB22:01:18:00 16FEB22:01:58:00
4 85-55 16FEB22:01:18:00 16FEB22:01:58:00
5 85-55 16FEB22:01:20:00 16FEB22:02:51:00
5 85-56 16FEB22:01:20:00 16FEB22:02:51:00
6 85-56 16FEB22:01:21:00 16FEB22:03:50:00
6 85-57 16FEB22:01:21:00 16FEB22:03:50:00
7 85-57 16FEB22:02:35:00 16FEB22:04:17:00
7 85-58 16FEB22:02:35:00 16FEB22:04:17:00
8 85-58 16FEB22:02:47:00 16FEB22:05:39:00
8 85-60 16FEB22:02:47:00 16FEB22:05:39:00
9 85-60 16FEB22:03:59:00 16FEB22:05:39:00
9 89-51 16FEB22:03:59:00 16FEB22:05:39:00
10 85-66 16FEB22:10:11:00 16FEB22:10:52:00
10 85-67 16FEB22:10:11:00 16FEB22:10:52:00
11 89-51 16FEB22:05:01:00 16FEB22:06:10:00
11 89-52 16FEB22:05:01:00 16FEB22:06:10:00
12 89-52 16FEB22:05:02:00 16FEB22:06:10:00
12 89-53 16FEB22:05:02:00 16FEB22:06:10:00
13 89-72 16FEB22:13:57:00 16FEB22:15:19:00
13 89-73 16FEB22:13:57:00 16FEB22:15:19:00
13 89-75 16FEB22:13:57:00 16FEB22:15:19:00
14 89-73 16FEB22:14:00:00 16FEB22:15:19:00
14 89-75 16FEB22:14:00:00 16FEB22:15:19:00
14 89-76 16FEB22:14:00:00 16FEB22:15:19:00
15 89-80 16FEB22:21:11:00 16FEB22:23:05:00
15 89-81 16FEB22:21:11:00 16FEB22:23:05:00
16 89-81 16FEB22:21:13:00 17FEB22:00:31:00
16 89-82 16FEB22:21:13:00 17FEB22:00:31:00
;
run;

data step1;
 set have(keep= start_time  end_time );
 do dt= start_time to end_time ;
  output; 
 end;
 keep dt;
run;
proc sort data=step1 out=step2 nodupkey;by dt;run;
data step3;
 set step2;
 if dif(dt) ne 1 then group+1;
 format dt datetime.;
run;
data want;
 if _n_=1 then do;
  declare hash h(dataset:'step3');
  h.definekey('dt');
  h.definedata('group');
  h.definedone();
 end;
set have;
 do dt= start_time to end_time ;
  call missing(group);
  if h.find()=0 then leave; 
 end;
drop dt;
run;
proc sort data=want;by group;run;

Sastech · Posted 04-05-2022 09:31 AM

Hi Ksharp,

Thank you for the quick response, the code which is sent is working fine with the testing data which we have

and we will check with the actual data which contains 4M rows for different days.

New group should be created as a column based on the matches of the records in two different group

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Re: New group should be created as a column based on the matches of the records in two different gro

Registration is open

SAS Training: Just a Click Away