BookmarkSubscribeRSS Feed
Stretlow
Obsidian | Level 7

Hi there, sorry second question of the day today.

 

I have a large data set with numerous variables two of which are "ID" and "TransactionDate"

 

what I want to do is count the number of transactions in two day windows per id

 

Example.

 

ID        TransactionDate     

1          01/11/2016              

1          01/11/2016

1          02/11/2016

1          02/11/2016

2          01/11/2016

2          07/11/2016

3          03/11/2016

3          04/11/2016

3          04/11/2016

3          04/11/2016

 

I only want to see ID 1 and 3 as they have 4 records in a two day window. 

 

I've had a play around with proq freq and the dif and Lag functions but I just cant seem to work it out.

 

Thanks in advance.

 

Marc

 

 

 

9 REPLIES 9
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, step 1 is to identify records which conform to your "2 day window" criteria.  You would need to flesh this out for me a bit, is it 2 days from earliest date, 2 days from previous etc.

 

Wonce you have that then a simple count() group by id <var>; would work.

Kurt_Bremser
Super User
data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;
Stretlow
Obsidian | Level 7

@Kurt_Bremser wrote:
data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;

 

Hi.

 

Thanks for this. the second part of the code there is what i was trying to do but i was a little out.

 

It looks to be working in the main however I have found an example where there are 4 observations on one day that are not in the last output dataset.

 

Do you know why this would be?

 

Thanks

 

Stret

Kurt_Bremser
Super User

I see. I am specifically checking for two days with data.

data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  if count >= 4 then output;
end;
else do;
  if count >= 4 then output;
end;
keep id transactiondate count lastdate;
run;

You might get redundant entries if you have a sequence like 3-3-3 on three successive days.

mkeintz
PROC Star

If your data are sorted by id/transaction date, then this will be efficient and simple:

 

notes:

  1. The do loop contructs a count of transactions (_NTRANS) for a single day
  2. Then a two-record count is calculated
  3. Only if the prior record was one day preceding the current record, keep it
  4. The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates).  If you don't  want that, use the second program
data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;

 

 

 

data want2 (drop=_:);
  retain _last_windowdate;  format _last_windowdate yymmddn8.;
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
      if first.id then _last_windowdate=.;
  end;
  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1  and (_last_windowdate^=transactiondate-2);
  _last_windowdate=transactiondate;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stretlow
Obsidian | Level 7

@mkeintz wrote:

If your data are sorted by id/transaction date, then this will be efficient and simple:

 

notes:

  1. The do loop contructs a count of transactions (_NTRANS) for a single day
  2. Then a two-record count is calculated
  3. Only if the prior record was one day preceding the current record, keep it
  4. The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates).  If you don't  want that, use the second program
data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;

 

 Hi.

 

This again seems to be working fine other than the anomoly i posted above.

 

I have one particular ID that has 8 Observations.

 

4 on the 18/11/2016

2 on the 29/11/2016

2 on the 30/11/2016

 

Its output the 29th and 30th however not the 4 on the 18th.

 

Thanks for the help

mkeintz
PROC Star

Oops sorry.  I see you mean 4 trans  WITHIN two days, even if only one of those days is present:

 

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions.  Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true.  Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

 

data want;

  do _ntrans=1 by 1 until(last.transactiondate);

    set have;

    by id transactiondate;

  end;

  total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

  if total_count>=4;

run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Stretlow
Obsidian | Level 7

@mkeintz wrote:

Oops sorry.  I see you mean 4 trans  WITHIN two days, even if only one of those days is present:

 

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions.  Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true.  Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

 

data want;

  do _ntrans=1 by 1 until(last.transactiondate);

    set have;

    by id transactiondate;

  end;

  total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

  if total_count>=4;

run;

 


This is making a bit more sense now and I think we're almost there with it.

heres a sample of the output I have:-

 

ID TransactionDate total_count
1510110787 18/11/2016 4
1510110787 30/11/2016 4
1510141720 26/11/2016 4
1510390852 21/11/2016 5
1510390852 22/11/2016 7
1510390852 28/11/2016 4
1510472816 20/11/2016 4

 

 

I think you touched on this earler in the thread about overlapping days. 

 

For ID 1510390852 there are 5 observations on the 21/11/2016 and 2 on the 22/11/2016.

 

Instead of the output being 5 and 7 for the 21st and 22nd, as these are within a two day window would just like to see 1 observation with a value of 7.

 

Is this possible?

 

Obviously for the first Id in that table as the 18th and the 30th arent within a two day window the 2 obervations with a value of 4 are correct.

 

Your continued help is appreciated

 

Stret

 

 

mkeintz
PROC Star

You want

 

For any pair of consecutive days output total transactions if the total>=4.  This just requires looking back one record.

 

For any single day not part of such a pair output total transactions if the total>=4.  this will require looking ahead one record to see if the next record is a consecutive date.  If not it's a singel day window.

 

Now what if you have 3 consecutive  days each with 3 transactions?   output the first pair and ignore the 3rd?  If you have 4 consecutive days with  3 transactions each, presumably you output two non-overlapping windows each with 6 transactions. Is that correct?

 

 

 

data want (drop=_: nxt_:);

  do _ntrans=1 by 1 until(last.transactiondate);
    set have;
    by id transactiondate;
    if eoh=0 then set have 
      (firstobs=2 keep=id transactiondate rename=(id=nxt_id  transactiondate=nxt_td))
      end=eoh;
  end;

  _records_since_output+1;

  /* Only add the lagged _ntrans if previous rec is one day prior */
  total_count = sum(_ntrans,ifn(dif(id)=0 and dif(transactiondate)=1,lag(_ntrans),0));

  nxt_day_present=(id=nxt_id and transactiondate+1=nxt_td);

  if total_count>=4 and (nxt_day_present=0 or _records_since_output>=2);
  _records_since_output=0;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 6499 views
  • 0 likes
  • 4 in conversation