Solved: Re: Count distinct hours in past 3hr

jovic92 · Posted 12-01-2017 07:15 PM

Hi everyone, I have one more problem.

I can't get the count of unique(distinct) values to use in hash table calculation.

data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 ;
run;

I need to find for every id, opt group (combination) in how many unique hours (column 't_spent') has the record (in past 3 hours). For example, if it's record from 10NOV2017:21:04:33, I need to count a distinct number of hours from 10NOV2017:18:04:33 to 10NOV2017:21:04:33 if id, opt combination has the record. Min what we can get is 1, max is 3.

I have this code (mostly part is useful and helpful and it's written by @Ksharp, but the part I've changed is not working very well) where I tried to count distinct of column 'hourIy_time' additionally created hourly_time = t_spent and formatted format hourly_time datetime11. so I could maybe use it to count distinct but this approach is not the best because it can easily catch 4 different hours. Also, this way I discovered that data step does not support DISTINCT function.

data want;
 if _n_=1 then do;
  if 0 then set have(rename=(t_spent=_t_spent));
  declare hash h(dataset:'have(rename=(t_spent=_t_spent))',hashexp:20);
  h.definekey('id','opt', 't_spent');
  h.definedata('_usage');
  h.definedone();
 end;
set have;
count_hours=0;
do i=t_spent-3*3600 to t_spent ;
 if h.find(key:id,key:opt,key:i)=0 then count_hours+_t_spent;
end;
drop _usage i;
run;

Desired output with the desired column:

id   opt     t_purchase           t_spent        usage  count_hours
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3         1
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0         2
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6         1
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3         2
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12         2
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7         1
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14         1 
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2         2
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5         3
2    b1  10NOV2017:12:55:24  10NOV2017:15:24:07    3         2

Thanks in advance.

ChrisNZ · Posted 12-13-2017 06:10 AM

Something like this (can't test atm) should be close, please run and alter as needed


proc sql;                
  create table WANT as
  select unique det.*
  , sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
  , sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
  , sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
  , sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

High-Performance SAS Coding - Third Edition

View solution in original post

Kurt_Bremser · Posted 12-02-2017 02:29 AM

Wouldn' that simply be

proc sql;
create table want as
select id, opt, count(distinct(hour(t_spent))) as hour_count
from have
group by id, opt;
quit;

?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 12-02-2017 05:18 AM

OK. Try this one.


data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
;
run;

data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h(dataset:'have',hashexp:20);
  h.definekey('id','opt', 't_spent');
  h.definedone();

  declare hash h_hour();
  h_hour.definekey('hour');
  h_hour.definedone();
 end;
set have;
do i=t_spent-3*3600 to t_spent ;
 if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
end;
count_hours=h_hour.num_items;
h_hour.clear();
drop hour i;
run;

jovic92 · Posted 12-03-2017 05:02 AM

Thank you very much. This works just as I needed.

I tasted only on small data, when I tried on 5.5m rows, it's working by now about 6 hours. Is it expected? I thought about splitting into more tables, but that would mean that I need to join them after and that's pretty much the same.

ChrisNZ · Posted 12-03-2017 08:58 PM

This runs in 30 seconds for 10 million records.


data HAVE;
  input ID  OPT  $   T_PURCHASE   : datetime32. T_SPENT :datetime32.  USAGE ;
  format  T_PURCHASE  T_SPENT datetime32. ;
  do ID=1 to 1e6; output; end;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
run;

data DIFF;
  do DIFF=0 to 3600*3;
    output; 
  end; 
run;

proc sql;                
  create table WANT as
  select unique det.*, count(sum.T_SPENT) as COUNT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

ID	OPT	T_SPENT	COUNT
1	a1	10NOV2017:13:57:12	1
1	a1	10NOV2017:14:02:58	2
1	a1	10NOV2017:14:05:00	3
1	a1	10NOV2017:15:20:33	4
1	a1	10NOV2017:18:24:50	1
1	a1	10NOV2017:20:22:07	2
1	b1	10NOV2017:13:01:42	1

The key to speed is to use equijoins, hence the intermediate table.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 12-04-2017 08:09 AM

You could fast it if sorting the table before .


data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
;
run;
proc sort data=have;
 by id  opt;
run;
data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h();
  h.definekey('id','opt', 't_spent');
  h.definedone();

  declare hash h_hour();
  h_hour.definekey('hour');
  h_hour.definedone();
 end;

do until(last.opt);
 set have;
 by id opt;
 h.ref();
end;

do until(last.opt);
 set have;
 by id opt;
 do i=t_spent-3*3600 to t_spent ;
   if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
 end;
 count_hours=h_hour.num_items;
 output;
end;
h.clear();
h_hour.clear();
drop hour i;
run;

ChrisNZ · Posted 12-04-2017 03:23 PM

@Ksharp This is still several orders of magnitude slower than SQL.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 12-05-2017 07:37 AM

@ChrisNZ

Yeah. You are right . It depends on what OP want.

jovic92 · Posted 12-13-2017 04:42 AM

@ChrisNZ @Ksharp

I know this was active days ago, but I just found a little problem.

Both versions give me (mostly) +1 hour that I needed. I know it's a very tricky to solve this because I need to count 3*60 minutes in past from the record and count how many records that combination has in first 60 min, in 120 min, and in 180 min in the past.

For example, If some combination has the record in 13:40 and 14: 05, for the record from 14:05 count need to be 1, not 2. Record 13:40 was in the past hour (13) but is just 25 minutes before, not the 60+.

Thank you both for your help!

ChrisNZ · Posted 12-13-2017 05:45 AM

The count method as given counts 14:05 as the first match and 13:40 as the second match for the record at 14:05, hence you have 2 macthes.

If you don't want to count 14:05, just subtract 1.

High-Performance SAS Coding - Third Edition

jovic92 · Posted 12-13-2017 05:55 AM

I'm not sure if you have understood the question, I'm sorry for not being so clear.

In how many hours (in past 3 hours, 180min/3) that combination has the record. Not how many matches(record) has in each hour.

Maybe someone has just the record in 14:03, 14:05, 14:19, 14:55, 15:02, and that need to be count=1. Because the range is <60min and it does not matter how many records have in that range. If the range is between 0min-60min and 60min-120min it is 2 (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02) , if it's between 0min-60min, 60min-120min and 120min-180min, that's count=3 (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02, 16:04), because have records in all the 3 past hours.

Range >180min I don't look at all, need to take observation for each record just the past 3 hours.

ChrisNZ · Posted 12-13-2017 06:10 AM

Something like this (can't test atm) should be close, please run and alter as needed


proc sql;                
  create table WANT as
  select unique det.*
  , sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
  , sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
  , sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
  , sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

High-Performance SAS Coding - Third Edition

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away