Create Count Based on Dates & Custom Date Logic

Time_Looper47 · Posted 12-18-2019 01:21 PM

Greetings,

So what I'm looking to do is create a count as a number of events for each key_id based on unique dates (with a caveat). My data looks like this:

key_id | event_dt | event_ts

1142 | 10JUL2019 | 10JUL2019:18:45:12

1142 | 10JUL2019 | 10JUL2019:18:00:39

1142 | 09JUL2019 | 09JUL2019:16:32:51

1142 | 15JUL2019 | 15JUL2019:12:30:34

3462 | 01OCT2019 | 01OCT2019:14:29:43

3462 | 01OCT2019 | 01OCT2019:14:00:12

3462 | 01OCT2019 | 01OCT2019:14:45:01

3462 | 12NOV2019 | 12NOV2019:11:17:21

4444 | 04AUG2019 | 04AUG2019:11:20:35

4444 | 02MAR2019 | 02MAR2019:10:55:13

4444 | 14DEC2019 | 14DEC2019:14:32:17

4444 | 21OCT2019 | 21OCT2019:14:12:42

event_dt format is: DATE9.

event_ts format is: DATETIME19.

The caveat is that, I want to create a count (by key_id) of events with unique dates, however, if an event falls on the same day and is within 60 mins of the other, I don't want to count that as a second event. So, for the above data, I would desire something like:

key_id | count_event

1142 | 3

3462 | 2

4444 | 4

Thank you

mkeintz · Posted 12-18-2019 01:28 PM

Questions: Does your 60- minute filter cross date boundaries, i.e. is 11JUL2019:00:15:00 within the same event as 10JUL2019:23:45:00?

If yes, then

data have;
  infile datalines dlm='|';
  input key_id  event_dt :date9.   event_ts  datetime18. ;
  format event_dt date9.  event_ts datetime18.;
datalines;
1142 | 10JUL2019 | 10JUL2019:18:45:12
1142 | 10JUL2019 | 10JUL2019:18:00:39
1142 | 09JUL2019 | 09JUL2019:16:32:51
1142 | 15JUL2019 | 15JUL2019:12:30:34
3462 | 01OCT2019 | 01OCT2019:14:29:43
3462 | 01OCT2019 | 01OCT2019:14:00:12
3462 | 01OCT2019 | 01OCT2019:14:45:01
3462 | 12NOV2019 | 12NOV2019:11:17:21
4444 | 04AUG2019 | 04AUG2019:11:20:35
4444 | 02MAR2019 | 02MAR2019:10:55:13
4444 | 14DEC2019 | 14DEC2019:14:32:17
4444 | 21OCT2019 | 21OCT2019:14:12:42
run;

proc sort data=have;
   by key_id event_ts;
run;

data want (keep=key_id n_events;
  set have;
  by key_id;

  if event_ts > intnx('minute',lag(event_ts),60) then n_events+1;
  if first.key_id then n_events=1;
  if last.key_id;
run;

But if your time stamps were to the second, rather than to the whole minute, you could have an error for the time span from, say, 13:00:32 to 14:00:58 which is more the 60 minutes (by an exce3ss of 36 seconds), but would not be detected because the INTNX function is using a granularity of 'minute'. You could address this via:

if event_ts > intnx('second',lag(event_ts),3600) then n_events+1;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Time_Looper47 · Posted 12-18-2019 03:40 PM

Hey @mkeintz , good question - it does NOT actually cross boundaries. If it's a new day (but within 60 minutes from a previous day) it's a new event no matter what

Time_Looper47 · Posted 12-18-2019 04:15 PM

@mkeintz @novinosrin

How would I modify this to address the fact that no matter what if it's a new day, it's a new event, even if it's within 60 mins across the boundary?

novinosrin · Posted 12-18-2019 03:31 PM

HI @Time_Looper47 Go with genie mark's solution. I am bored, so my share of fun


data have;
input key_id  @8 event_dt date9. @20 event_ts datetime20.;
format event_dt date9. event_ts datetime20.;
cards;
1142 | 10JUL2019 | 10JUL2019:18:45:12
1142 | 10JUL2019 | 10JUL2019:18:00:39
1142 | 09JUL2019 | 09JUL2019:16:32:51
1142 | 15JUL2019 | 15JUL2019:12:30:34
3462 | 01OCT2019 | 01OCT2019:14:29:43
3462 | 01OCT2019 | 01OCT2019:14:00:12
3462 | 01OCT2019 | 01OCT2019:14:45:01
3462 | 12NOV2019 | 12NOV2019:11:17:21
4444 | 04AUG2019 | 04AUG2019:11:20:35
4444 | 02MAR2019 | 02MAR2019:10:55:13
4444 | 14DEC2019 | 14DEC2019:14:32:17
4444 | 21OCT2019 | 21OCT2019:14:12:42
;

data want;
 do until(last.key_id);
  do until(last.event_dt);
   set have;
   by key_id event_dt notsorted;
   if not first.event_dt and abs(dif(minute(event_ts)))<=60 then continue;
   count_event=sum(count_event,1);
  end;
 end;
run;

PGStats · Posted 12-18-2019 06:48 PM

What if you have dates

01OCT2019:14:00:12
01OCT2019:14:15:12
01OCT2019:15:10:12

The first two are within 60 mins, so are the last two, but not the first and last. Does that count as one or two distinct events? What's the rule for such cases?

PG

Time_Looper47 · Posted 12-19-2019 01:38 PM

@PGStats Good question - so the way these cases would be handled is, you go 60 mins from the FIRST event. So in your example it would count as 2 events (the first two get grouped into one, and since the last one is greater than 60 mins from the first, it would be a separate event.

Also note that: if an event falls on a separate day but is within 60 mins of the first event (lets say the events were 01OCT2019:11:30:12 and 02OCT2019:00:10:10) it counts as two separate events.

novinosrin · Posted 12-19-2019 02:25 PM

Hi @Time_Looper47 Can you try this?

proc sort data=have out=_have;
by key_id event_dt event_ts;
run;
data want;
 set _have;
 by key_id event_dt ;
 retain t;
 if first.key_id  then do; count=1;t=event_ts;end;
 else if first.event_dt then do; count+1;t=event_ts;end;
 else if intck('minute',t,event_ts)>60 then do;
  t=event_ts;
  count+1;
 end;
 if last.key_id;
 keep key_id count;
run;

PGStats · Posted 12-19-2019 03:06 PM

This would work:

data have;
  infile datalines dlm='|';
  input key_id  event_dt :date9.   event_ts  datetime18. ;
  format event_dt date9.  event_ts datetime18.;
datalines;
1142 | 10JUL2019 | 10JUL2019:18:45:12
1142 | 10JUL2019 | 10JUL2019:18:00:39
1142 | 09JUL2019 | 09JUL2019:16:32:51
1142 | 15JUL2019 | 15JUL2019:12:30:34
3462 | 01OCT2019 | 01OCT2019:14:29:43
3462 | 01OCT2019 | 01OCT2019:14:00:12
3462 | 01OCT2019 | 01OCT2019:14:45:01
3462 | 12NOV2019 | 12NOV2019:11:17:21
4444 | 04AUG2019 | 04AUG2019:11:20:35
4444 | 02MAR2019 | 02MAR2019:10:55:13
4444 | 14DEC2019 | 14DEC2019:14:32:17
4444 | 21OCT2019 | 21OCT2019:14:12:42
4444 | 21OCT2019 | 21OCT2019:23:12:42  
4444 | 22OCT2019 | 22OCT2019:00:02:42  New day
4444 | 22OCT2019 | 22OCT2019:00:52:42  Same event
4444 | 22OCT2019 | 22OCT2019:01:12:42  New event
run;

proc sort data=have;
   by key_id event_ts;
run;

data temp (keep=key_id event);
retain first_event_ts;
set have; by key_id;
if 	first.key_id or 
	event_dt > lag(event_dt) or 
	intnx("minute", event_ts, -60) > first_event_ts then do;
		event + 1;
		first_event_ts = event_ts;
		end;	
run;

proc sql;
create table want as
select 
	key_id ,
	count (distinct event) as count_event
from temp
group by key_id;
select * from want;
quit;

Note: I added data to test your latest requirements.

PG

Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Re: Create Count Based on Dates & Custom Date Logic

Classroom Training Available!