Re: Count of observations by id and date if there are 4 or more transa...

Stretlow · Posted 12-05-2016 08:05 AM

Hi there, sorry second question of the day today.

I have a large data set with numerous variables two of which are "ID" and "TransactionDate"

what I want to do is count the number of transactions in two day windows per id

Example.

ID TransactionDate

1 01/11/2016

1 02/11/2016

2 01/11/2016

2 07/11/2016

3 03/11/2016

3 04/11/2016

I only want to see ID 1 and 3 as they have 4 records in a two day window.

I've had a play around with proq freq and the dif and Lag functions but I just cant seem to work it out.

Thanks in advance.

Marc

RW9 · Posted 12-05-2016 08:15 AM

Well, step 1 is to identify records which conform to your "2 day window" criteria. You would need to flesh this out for me a bit, is it 2 days from earliest date, 2 days from previous etc.

Wonce you have that then a simple count() group by id <var>; would work.

Kurt_Bremser · Posted 12-05-2016 08:15 AM

data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Stretlow · Posted 12-05-2016 10:07 AM

@Kurt_Bremser wrote:

data have;
input id transactiondate :ddmmyy10.;
format transactiondate ddmmyy10.;
cards;
1          01/11/2016
1          01/11/2016
1          02/11/2016
1          02/11/2016
2          01/11/2016
2          07/11/2016
3          03/11/2016
3          04/11/2016
3          04/11/2016
3          04/11/2016
;
run;

/* First accumulate per day */
proc sql;
create table int as
select id, transactiondate, count(*) as count
from have
group by id,transactiondate
;
quit;

/* Now compare with previous observation */
data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  output;
end;
keep id transactiondate count lastdate;
run;

Hi.

Thanks for this. the second part of the code there is what i was trying to do but i was a little out.

It looks to be working in the main however I have found an example where there are 4 observations on one day that are not in the last output dataset.

Do you know why this would be?

Thanks

Stret

Kurt_Bremser · Posted 12-05-2016 10:18 AM

I see. I am specifically checking for two days with data.

data want;
set int;
by id;
lastdate = lag(transactiondate);
format lastdate ddmmyy10.;
lastcount = lag(count);
if lastdate = transactiondate - 1
then do;
  count = count + lastcount;
  if count >= 4 then output;
end;
else do;
  if count >= 4 then output;
end;
keep id transactiondate count lastdate;
run;

You might get redundant entries if you have a sequence like 3-3-3 on three successive days.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 12-05-2016 10:19 AM

If your data are sorted by id/transaction date, then this will be efficient and simple:

notes:

The do loop contructs a count of transactions (_NTRANS) for a single day
Then a two-record count is calculated
Only if the prior record was one day preceding the current record, keep it
The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates). If you don't want that, use the second program

data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;

data want2 (drop=_:);
  retain _last_windowdate;  format _last_windowdate yymmddn8.;
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
      if first.id then _last_windowdate=.;
  end;
  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1  and (_last_windowdate^=transactiondate-2);
  _last_windowdate=transactiondate;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Stretlow · Posted 12-05-2016 10:59 AM

@mkeintz wrote:

If your data are sorted by id/transaction date, then this will be efficient and simple:

notes:

The do loop contructs a count of transactions (_NTRANS) for a single day

Then a two-record count is calculated

Only if the prior record was one day preceding the current record, keep it

The first program, however, will output overlapping two-day windows (i.e. 3 consecutive dates). If you don't want that, use the second program
data want (drop=_:);
  do _ntrans=1 by 1 until (last.transactiondate);
     set have;
     by id transactiondate ;
  end;

  twoday_count=sum(_ntrans,lag(_ntrans));
  if dif(transactiondate)=1;
run;
Hi.

This again seems to be working fine other than the anomoly i posted above.

I have one particular ID that has 8 Observations.

4 on the 18/11/2016

2 on the 29/11/2016

2 on the 30/11/2016

Its output the 29th and 30th however not the 4 on the 18th.

Thanks for the help

mkeintz · Posted 12-05-2016 01:10 PM

Oops sorry. I see you mean 4 trans WITHIN two days, even if only one of those days is present:

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions. Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true. Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

data want;

do _ntrans=1 by 1 until(last.transactiondate);

set have;

by id transactiondate;

end;

total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

if total_count>=4;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Stretlow · Posted 12-06-2016 04:23 AM

@mkeintz wrote:

Oops sorry. I see you mean 4 trans WITHIN two days, even if only one of those days is present:

BTW, notice the "trick" here is to update the LAG/DIF queues only at the end of each day's set of transactions. Also, even though the "lag(_ntrans)" is inside an IFN function, the lag queue is updated regardless of whether the ifn test is true. Which is what you want - the lag queue is always updated even if its results are not returned by the IFN function.

data want;

  do _ntrans=1 by 1 until(last.transactiondate);

    set have;

    by id transactiondate;

end;

  total_count = sum(_ntrans,ifn(dif(transactiondate)=1 and dif(id)=0,lag(_ntrans),0));

if total_count>=4;

run;

This is making a bit more sense now and I think we're almost there with it.

heres a sample of the output I have:-

ID	TransactionDate	total_count
1510110787	18/11/2016	4
1510110787	30/11/2016	4
1510141720	26/11/2016	4
1510390852	21/11/2016	5
1510390852	22/11/2016	7
1510390852	28/11/2016	4
1510472816	20/11/2016	4

I think you touched on this earler in the thread about overlapping days.

For ID 1510390852 there are 5 observations on the 21/11/2016 and 2 on the 22/11/2016.

Instead of the output being 5 and 7 for the 21st and 22nd, as these are within a two day window would just like to see 1 observation with a value of 7.

Is this possible?

Obviously for the first Id in that table as the 18th and the 30th arent within a two day window the 2 obervations with a value of 4 are correct.

Your continued help is appreciated

Stret

mkeintz · Posted 12-06-2016 02:50 PM

You want

For any pair of consecutive days output total transactions if the total>=4. This just requires looking back one record.

For any single day not part of such a pair output total transactions if the total>=4. this will require looking ahead one record to see if the next record is a consecutive date. If not it's a singel day window.

Now what if you have 3 consecutive days each with 3 transactions? output the first pair and ignore the 3rd? If you have 4 consecutive days with 3 transactions each, presumably you output two non-overlapping windows each with 6 transactions. Is that correct?

data want (drop=_: nxt_:);

  do _ntrans=1 by 1 until(last.transactiondate);
    set have;
    by id transactiondate;
    if eoh=0 then set have 
      (firstobs=2 keep=id transactiondate rename=(id=nxt_id  transactiondate=nxt_td))
      end=eoh;
  end;

  _records_since_output+1;

  /* Only add the lagged _ntrans if previous rec is one day prior */
  total_count = sum(_ntrans,ifn(dif(id)=0 and dif(transactiondate)=1,lag(_ntrans),0));

  nxt_day_present=(id=nxt_id and transactiondate+1=nxt_td);

  if total_count>=4 and (nxt_day_present=0 or _records_since_output>=2);
  _records_since_output=0;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

Re: Count of observations by id and date if there are 4 or more transactions in a 2 day window

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away