BookmarkSubscribeRSS Feed
Stretlow
Obsidian | Level 7

Hi there, sorry second question of the day today.

 

I have a large data set with numerous variables two of which are "ID" and "TransactionDate"

 

what I want to do is count the number of transactions in two day windows per id

 

Example.

 

ID        TransactionDate     

1          01/11/2016              

1          01/11/2016

1          02/11/2016

1          02/11/2016

2          01/11/2016

2          07/11/2016

3          03/11/2016

3          04/11/2016

3          04/11/2016

3          04/11/2016

 

I only want to see ID 1 and 3 as they have 4 records in a two day window. 

 

I've had a play around with proq freq and the dif and Lag functions but I just cant seem to work it out.

 

Thanks in advance.

 

Marc

 

 

 

9 REPLIES 9
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, step 1 is to identify records which conform to your "2 day window" criteria.  You would need to flesh this out for me a bit, is it 2 days from earliest date, 2 days from previous etc.

 

Wonce you have that then a simple count() group by id <var>; would work.

Kurt_Bremser
Super User
data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;
Stretlow
Obsidian | Level 7

@Kurt_Bremser wrote:
data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;

 

Hi.

 

Thanks for this. the second part of the code there is what i was trying to do but i was a little out.

 

It looks to be working in the main however I have found an example where there are 4 observations on one day that are not in the last output dataset.

 

Do you know why this would be?

 

Thanks

 

Stret

Kurt_Bremser
Super User

I see. I am specifically checking for two days with data.

data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  if count >= 4 then output;
end;
else do;
  if count >= 4 then output;
end;
keep id transactiondate count lastdate;
run;

You might get redundant entries if you have a sequence like 3-3-3 on three successive days.

mkeintz
PROC Star

If your data are sorted by id/transaction date, then this will be efficient and simple:

 

notes:

  1. The do loop contructs a count of transactions (_NTRANS) for a single day
  2. Then a two-record count is calculated
  3. Only if the prior record was one day preceding the current record, keep it
  4. The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates).  If you don't  want that, use the second program
data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;

 

 

 

data want2 (drop=_:);
  retain _last_windowdate;  format _last_windowdate yymmddn8.;
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
      if first.id then _last_windowdate=.;
  end;
  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1  and (_last_windowdate^=transactiondate-2);
  _last_windowdate=transactiondate;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stretlow
Obsidian | Level 7

@mkeintz wrote:

If your data are sorted by id/transaction date, then this will be efficient and simple:

 

notes:

  1. The do loop contructs a count of transactions (_NTRANS) for a single day
  2. Then a two-record count is calculated
  3. Only if the prior record was one day preceding the current record, keep it
  4. The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates).  If you don't  want that, use the second program
data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;

 

 Hi.

 

This again seems to be working fine other than the anomoly i posted above.

 

I have one particular ID that has 8 Observations.

 

4 on the 18/11/2016

2 on the 29/11/2016

2 on the 30/11/2016

 

Its output the 29th and 30th however not the 4 on the 18th.

 

Thanks for the help

mkeintz
PROC Star

Oops sorry.  I see you mean 4 trans  WITHIN two days, even if only one of those days is present:

 

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions.  Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true.  Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

 

data want;

  do _ntrans=1 by 1 until(last.transactiondate);

    set have;

    by id transactiondate;

  end;

  total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

  if total_count>=4;

run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stretlow
Obsidian | Level 7

@mkeintz wrote:

Oops sorry.  I see you mean 4 trans  WITHIN two days, even if only one of those days is present:

 

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions.  Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true.  Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

 

data want;

  do _ntrans=1 by 1 until(last.transactiondate);

    set have;

    by id transactiondate;

  end;

  total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

  if total_count>=4;

run;

 


This is making a bit more sense now and I think we're almost there with it.

heres a sample of the output I have:-

 

ID TransactionDate total_count
1510110787 18/11/2016 4
1510110787 30/11/2016 4
1510141720 26/11/2016 4
1510390852 21/11/2016 5
1510390852 22/11/2016 7
1510390852 28/11/2016 4
1510472816 20/11/2016 4

 

 

I think you touched on this earler in the thread about overlapping days. 

 

For ID 1510390852 there are 5 observations on the 21/11/2016 and 2 on the 22/11/2016.

 

Instead of the output being 5 and 7 for the 21st and 22nd, as these are within a two day window would just like to see 1 observation with a value of 7.

 

Is this possible?

 

Obviously for the first Id in that table as the 18th and the 30th arent within a two day window the 2 obervations with a value of 4 are correct.

 

Your continued help is appreciated

 

Stret

 

 

mkeintz
PROC Star

You want

 

For any pair of consecutive days output total transactions if the total>=4.  This just requires looking back one record.

 

For any single day not part of such a pair output total transactions if the total>=4.  this will require looking ahead one record to see if the next record is a consecutive date.  If not it's a singel day window.

 

Now what if you have 3 consecutive  days each with 3 transactions?   output the first pair and ignore the 3rd?  If you have 4 consecutive days with  3 transactions each, presumably you output two non-overlapping windows each with 6 transactions. Is that correct?

 

 

 

data want (drop=_: nxt_:);

  do _ntrans=1 by 1 until(last.transactiondate);
    set have;
    by id transactiondate;
    if eoh=0 then set have 
      (firstobs=2 keep=id transactiondate rename=(id=nxt_id  transactiondate=nxt_td))
      end=eoh;
  end;

  _records_since_output+1;

  /* Only add the lagged _ntrans if previous rec is one day prior */
  total_count = sum(_ntrans,ifn(dif(id)=0 and dif(transactiondate)=1,lag(_ntrans),0));

  nxt_day_present=(id=nxt_id and transactiondate+1=nxt_td);

  if total_count>=4 and (nxt_day_present=0 or _records_since_output>=2);
  _records_since_output=0;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 7159 views
  • 0 likes
  • 4 in conversation