Hi everyone, I have one more problem.
I can't get the count of unique(distinct) values to use in hash table calculation.
data have;
input id opt $ t_purchase : datetime32. t_spent :datetime32. usage ;
format t_purchase t_spent datetime32. ;
cards;
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3
1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0
1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6
1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3
1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12
1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5 ;
run;
I need to find for every id, opt group (combination) in how many unique hours (column 't_spent') has the record (in past 3 hours). For example, if it's record from 10NOV2017:21:04:33, I need to count a distinct number of hours from 10NOV2017:18:04:33 to 10NOV2017:21:04:33 if id, opt combination has the record. Min what we can get is 1, max is 3.
I have this code (mostly part is useful and helpful and it's written by @Ksharp, but the part I've changed is not working very well) where I tried to count distinct of column 'hourIy_time' additionally created hourly_time = t_spent and formatted format hourly_time datetime11. so I could maybe use it to count distinct but this approach is not the best because it can easily catch 4 different hours. Also, this way I discovered that data step does not support DISTINCT function.
data want;
if _n_=1 then do;
if 0 then set have(rename=(t_spent=_t_spent));
declare hash h(dataset:'have(rename=(t_spent=_t_spent))',hashexp:20);
h.definekey('id','opt', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
count_hours=0;
do i=t_spent-3*3600 to t_spent ;
if h.find(key:id,key:opt,key:i)=0 then count_hours+_t_spent;
end;
drop _usage i;
run;
Desired output with the desired column:
id opt t_purchase t_spent usage count_hours
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3 1
1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0 2
1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6 1
1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3 2
1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12 2
1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7 1
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14 1
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5 3
2 b1 10NOV2017:12:55:24 10NOV2017:15:24:07 3 2
Thanks in advance.
Something like this (can't test atm) should be close, please run and alter as needed
proc sql;
create table WANT as
select unique det.*
, sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
, sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
, sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
, sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
from HAVE det
,DIFF dif
,HAVE sum
where det.ID =sum.ID
and det.OPT =sum.OPT
and det.T_SPENT-DIFF=sum.T_SPENT
group by det.ID , det.OPT, det.T_SPENT
order by det.ID , det.OPT, det.T_SPENT ;
quit;
Wouldn' that simply be
proc sql;
create table want as
select id, opt, count(distinct(hour(t_spent))) as hour_count
from have
group by id, opt;
quit;
?
OK. Try this one.
data have;
input id opt $ t_purchase : datetime32. t_spent :datetime32. usage ;
format t_purchase t_spent datetime32. ;
cards;
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3
1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0
1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6
1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3
1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12
1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5
;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h(dataset:'have',hashexp:20);
h.definekey('id','opt', 't_spent');
h.definedone();
declare hash h_hour();
h_hour.definekey('hour');
h_hour.definedone();
end;
set have;
do i=t_spent-3*3600 to t_spent ;
if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
end;
count_hours=h_hour.num_items;
h_hour.clear();
drop hour i;
run;
Thank you very much. This works just as I needed.
I tasted only on small data, when I tried on 5.5m rows, it's working by now about 6 hours. Is it expected? I thought about splitting into more tables, but that would mean that I need to join them after and that's pretty much the same.
This runs in 30 seconds for 10 million records.
data HAVE;
input ID OPT $ T_PURCHASE : datetime32. T_SPENT :datetime32. USAGE ;
format T_PURCHASE T_SPENT datetime32. ;
do ID=1 to 1e6; output; end;
cards;
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3
1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0
1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6
1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3
1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12
1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5
run;
data DIFF;
do DIFF=0 to 3600*3;
output;
end;
run;
proc sql;
create table WANT as
select unique det.*, count(sum.T_SPENT) as COUNT
from HAVE det
,DIFF dif
,HAVE sum
where det.ID =sum.ID
and det.OPT =sum.OPT
and det.T_SPENT-DIFF=sum.T_SPENT
group by det.ID , det.OPT, det.T_SPENT
order by det.ID , det.OPT, det.T_SPENT ;
quit;
ID | OPT | T_SPENT | COUNT |
1 | a1 | 10NOV2017:13:57:12 | 1 |
1 | a1 | 10NOV2017:14:02:58 | 2 |
1 | a1 | 10NOV2017:14:05:00 | 3 |
1 | a1 | 10NOV2017:15:20:33 | 4 |
1 | a1 | 10NOV2017:18:24:50 | 1 |
1 | a1 | 10NOV2017:20:22:07 | 2 |
1 | b1 | 10NOV2017:13:01:42 | 1 |
The key to speed is to use equijoins, hence the intermediate table.
You could fast it if sorting the table before .
data have;
input id opt $ t_purchase : datetime32. t_spent :datetime32. usage ;
format t_purchase t_spent datetime32. ;
cards;
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3
1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0
1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6
1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3
1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12
1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5
;
run;
proc sort data=have;
by id opt;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h();
h.definekey('id','opt', 't_spent');
h.definedone();
declare hash h_hour();
h_hour.definekey('hour');
h_hour.definedone();
end;
do until(last.opt);
set have;
by id opt;
h.ref();
end;
do until(last.opt);
set have;
by id opt;
do i=t_spent-3*3600 to t_spent ;
if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
end;
count_hours=h_hour.num_items;
output;
end;
h.clear();
h_hour.clear();
drop hour i;
run;
@Ksharp This is still several orders of magnitude slower than SQL.
Yeah. You are right . It depends on what OP want.
I know this was active days ago, but I just found a little problem.
Both versions give me (mostly) +1 hour that I needed. I know it's a very tricky to solve this because I need to count 3*60 minutes in past from the record and count how many records that combination has in first 60 min, in 120 min, and in 180 min in the past.
For example, If some combination has the record in 13:40 and 14: 05, for the record from 14:05 count need to be 1, not 2. Record 13:40 was in the past hour (13) but is just 25 minutes before, not the 60+.
Thank you both for your help!
The count method as given counts 14:05 as the first match and 13:40 as the second match for the record at 14:05, hence you have 2 macthes.
If you don't want to count 14:05, just subtract 1.
I'm not sure if you have understood the question, I'm sorry for not being so clear.
In how many hours (in past 3 hours, 180min/3) that combination has the record. Not how many matches(record) has in each hour.
Maybe someone has just the record in 14:03, 14:05, 14:19, 14:55, 15:02, and that need to be count=1. Because the range is <60min and it does not matter how many records have in that range. If the range is between 0min-60min and 60min-120min it is 2 (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02) , if it's between 0min-60min, 60min-120min and 120min-180min, that's count=3 (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02, 16:04), because have records in all the 3 past hours.
Range >180min I don't look at all, need to take observation for each record just the past 3 hours.
Something like this (can't test atm) should be close, please run and alter as needed
proc sql;
create table WANT as
select unique det.*
, sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
, sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
, sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
, sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
from HAVE det
,DIFF dif
,HAVE sum
where det.ID =sum.ID
and det.OPT =sum.OPT
and det.T_SPENT-DIFF=sum.T_SPENT
group by det.ID , det.OPT, det.T_SPENT
order by det.ID , det.OPT, det.T_SPENT ;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.