BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jovic92
Obsidian | Level 7

Hi everyone, I have one more problem. 

I can't get the count of unique(distinct) values to use in hash table calculation.

data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5
;
run;

 I need to find for every id, opt group (combination) in how many unique hours (column 't_spent') has the record (in past 3 hours). For example, if it's record from 10NOV2017:21:04:33, I need to count a distinct number of hours from 10NOV2017:18:04:33 to 10NOV2017:21:04:33 if id, opt combination has the record. Min what we can get is 1, max is 3. 

I have this code (mostly part is useful and helpful and it's written by @Ksharp, but the part I've changed is not working very well) where I tried to count distinct of column  'hourIy_time' additionally created  hourly_time = t_spent and formatted format hourly_time datetime11. so I could maybe use it to count distinct but this approach is not the best because it can easily catch 4 different hours. Also, this way I discovered that data step does not support DISTINCT function. 

 

data want;
 if _n_=1 then do;
  if 0 then set have(rename=(t_spent=_t_spent));
  declare hash h(dataset:'have(rename=(t_spent=_t_spent))',hashexp:20);
  h.definekey('id','opt', 't_spent');
  h.definedata('_usage');
  h.definedone();
 end;
set have;
count_hours=0;
do i=t_spent-3*3600 to t_spent ;
 if h.find(key:id,key:opt,key:i)=0 then count_hours+_t_spent;
end;
drop _usage i;
run;

Desired output with the desired column:

id   opt     t_purchase           t_spent        usage  count_hours
1 a1 10NOV2017:12:02:00 10NOV2017:14:05:00 3 1 1 a1 10NOV2017:12:02:00 10NOV2017:15:20:33 0 2 1 a1 10NOV2017:12:02:00 10NOV2017:18:24:50 6 1 1 a1 10NOV2017:12:02:00 10NOV2017:14:02:58 3 2 1 a1 10NOV2017:12:02:00 10NOV2017:20:22:07 12 2 1 a1 10NOV2017:12:02:00 10NOV2017:13:57:12 7 1
2 b1 10NOV2017:12:55:24 10NOV2017:13:01:42 14 1
2 b1 10NOV2017:12:55:24 10NOV2017:14:42:25 2 2
2 b1 10NOV2017:12:55:24 10NOV2017:16:55:32 5 3
2 b1 10NOV2017:12:55:24 10NOV2017:15:24:07 3 2

 

 Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Something like this (can't test atm) should be close, please run and alter as needed


proc sql;                
  create table WANT as
  select unique det.*
  , sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
  , sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
  , sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
  , sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

 

 

View solution in original post

11 REPLIES 11
Kurt_Bremser
Super User

Wouldn' that simply be

proc sql;
create table want as
select id, opt, count(distinct(hour(t_spent))) as hour_count
from have
group by id, opt;
quit;

?

Ksharp
Super User

OK. Try this one.

 


data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
;
run;

data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h(dataset:'have',hashexp:20);
  h.definekey('id','opt', 't_spent');
  h.definedone();

  declare hash h_hour();
  h_hour.definekey('hour');
  h_hour.definedone();
 end;
set have;
do i=t_spent-3*3600 to t_spent ;
 if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
end;
count_hours=h_hour.num_items;
h_hour.clear();
drop hour i;
run;
jovic92
Obsidian | Level 7

Thank you very much. This works just as I needed.

I tasted only on small data, when I tried on 5.5m rows, it's working by now about 6 hours. Is it expected? I thought about splitting into more tables, but that would mean that I need to join them after and that's pretty much the same. 

ChrisNZ
Tourmaline | Level 20

This runs in 30 seconds for 10 million records.

 

 


data HAVE;
  input ID  OPT  $   T_PURCHASE   : datetime32. T_SPENT :datetime32.  USAGE ;
  format  T_PURCHASE  T_SPENT datetime32. ;
  do ID=1 to 1e6; output; end;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
run;

data DIFF;
  do DIFF=0 to 3600*3;
    output; 
  end; 
run;

proc sql;                
  create table WANT as
  select unique det.*, count(sum.T_SPENT) as COUNT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

  

 ID  OPT  T_SPENT  COUNT
1 a1 10NOV2017:13:57:12 1
1 a1 10NOV2017:14:02:58 2
1 a1 10NOV2017:14:05:00 3
1 a1 10NOV2017:15:20:33 4
1 a1 10NOV2017:18:24:50 1
1 a1 10NOV2017:20:22:07 2
1 b1 10NOV2017:13:01:42 1

  

The key to speed is to use equijoins, hence the intermediate table.

 

 

 

Ksharp
Super User

You could fast it if sorting the table before .

 


data have;
input id  opt  $   t_purchase   : datetime32. t_spent :datetime32.  usage ;
format  t_purchase  t_spent datetime32. ;
cards; 
1    a1  10NOV2017:12:02:00  10NOV2017:14:05:00    3  
1    a1  10NOV2017:12:02:00  10NOV2017:15:20:33    0 
1    a1  10NOV2017:12:02:00  10NOV2017:18:24:50    6  
1    a1  10NOV2017:12:02:00  10NOV2017:14:02:58    3 
1    a1  10NOV2017:12:02:00  10NOV2017:20:22:07   12  
1    a1  10NOV2017:12:02:00  10NOV2017:13:57:12    7  
2    b1  10NOV2017:12:55:24  10NOV2017:13:01:42   14  
2    b1  10NOV2017:12:55:24  10NOV2017:14:42:25    2 
2    b1  10NOV2017:12:55:24  10NOV2017:16:55:32    5 
;
run;
proc sort data=have;
 by id  opt;
run;
data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h();
  h.definekey('id','opt', 't_spent');
  h.definedone();

  declare hash h_hour();
  h_hour.definekey('hour');
  h_hour.definedone();
 end;

do until(last.opt);
 set have;
 by id opt;
 h.ref();
end;

do until(last.opt);
 set have;
 by id opt;
 do i=t_spent-3*3600 to t_spent ;
   if h.check(key:id,key:opt,key:i)=0 then do;hour=hour(i);h_hour.ref();end;
 end;
 count_hours=h_hour.num_items;
 output;
end;
h.clear();
h_hour.clear();
drop hour i;
run;
ChrisNZ
Tourmaline | Level 20

@Ksharp This is still several orders of magnitude slower than SQL.

Ksharp
Super User

@ChrisNZ

Yeah. You are right . It depends on what OP want.

jovic92
Obsidian | Level 7

@ChrisNZ @Ksharp 

I know this was active days ago, but I just found a little problem. 

Both versions give me (mostly) +1 hour that I needed. I know it's a very tricky to solve this because I need to count 3*60 minutes in past from the record and count how many records that combination has in first 60 min, in 120 min, and in 180 min in the past. 

For example, If some combination has the record in 13:40 and 14: 05, for the record from 14:05 count need to be 1, not 2. Record 13:40 was in the past hour (13) but is just 25 minutes before, not the 60+. 

 

Thank you both for your help! 

ChrisNZ
Tourmaline | Level 20

The count method as given counts 14:05 as the first match and 13:40 as the second match for the record at 14:05, hence you have 2 macthes.

If you don't want to count 14:05, just subtract 1.

jovic92
Obsidian | Level 7

I'm not sure if you have understood the question, I'm sorry for not being so clear.  

In how many hours (in past 3 hours, 180min/3) that combination has the record. Not how many matches(record) has in each hour. 

Maybe someone has just the record in 14:03, 14:05, 14:19, 14:55, 15:02, and that need to be count=1. Because the range is <60min and it does not matter how many records have in that range. If the range is between 0min-60min and 60min-120min it is 2 (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02) , if it's between 0min-60min, 60min-120min and 120min-180min, that's count=3  (example 14:01, 14:03, 14:05, 14:19, 14:55, 15:02, 16:04), because have records in all the 3 past hours.

Range >180min I don't look at all, need to take observation for each record just the past 3 hours. 

ChrisNZ
Tourmaline | Level 20

Something like this (can't test atm) should be close, please run and alter as needed


proc sql;                
  create table WANT as
  select unique det.*
  , sum(sum.T_SPENT*(3600*0<DIFF<=3600*1))>0 as INTERVAL1H
  , sum(sum.T_SPENT*(3600*1<DIFF<=3600*2))>0 as INTERVAL2H
  , sum(sum.T_SPENT*(3600*2<DIFF<=3600*3))>0 as INTERVAL3H
  , sum(calculated INTERVAL1H,calculated INTERVAL2H,calculated INTERVAL3H) as COUNT_THAT_YOU_WANT
   from HAVE det
       ,DIFF dif
       ,HAVE sum
   where det.ID          =sum.ID 
     and det.OPT         =sum.OPT
     and det.T_SPENT-DIFF=sum.T_SPENT
  group by det.ID , det.OPT, det.T_SPENT  
  order by det.ID , det.OPT, det.T_SPENT ; 
quit;

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2150 views
  • 6 likes
  • 4 in conversation