Solved: Re: How can I find unique values within X consecutive days

Zatere · Posted 07-16-2023 01:57 PM

Hello,

I would ask for your assistance on the following.

I have data in the following format:

data have;
input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
format DT ddmmyy10.;
datalines;
1 04/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
;
run;

I would like to be able to know for each ID how many unique products within a consecutive 10 days period.

The output should be like the below:

data want;
input ID DT :ddmmyy10. PRODUCT$ AMOUNT uniq_products;
format DT ddmmyy10.;
datalines;
1 05/02/2023 A 100 2
1 15/02/2023 B 55 1
1 01/03/2023 B 65 2
1 07/03/2023 A 89 2
1 10/03/2023 B 75 1
1 24/03/2023 A 300 1
2 15/02/2023 C 400 2
2 25/02/2023 D 56 1
;
run;

ID	DT	PRODUCT	AMOUNT	uniq_products
1	05/02/2023	A	100	2
1	15/02/2023	B	55	1
1	01/03/2023	B	65	2
1	07/03/2023	A	89	2
1	10/03/2023	B	75	1
1	24/03/2023	A	300	1
2	15/02/2023	C	400	2
2	25/02/2023	D	56	1

Any help please?

Thanks in advance.

mkeintz · Posted 07-16-2023 04:00 PM

@Zatere wrote:

Hi thank you for your reply.

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.

If the SQL solution "is not the best in large volume of data" you can run a DATA step that reduces memory needs (and possibly time) dramatically by relying on the fact that the data are sorted by ID. It uses hash objects that are cleared for re-use each time a new ID is encountered:

data have;
  input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
  format DT ddmmyy10.;
  format DT date9. ;
datalines;
1 05/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
run;


data want (drop=_:);
  set have (in=firstpass)   have (in=secondpass) ;
  by id;

  if _n_=1 then do;
    _d=.;   /* Use this for date index */
    declare hash prod_by_date ();
      prod_by_date.definekey('_d','product');
      prod_by_date.definedone();
    declare hash date_census();
      date_census.definekey('_d');
      date_census.definedata('_d','uniq_product');
      date_census.definedone();
  end;

  if first.id then do;
    _rc=prod_by_date.clear();
    _rc=date_census.clear();
  end;

  if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

  if secondpass;
  date_census.find(key:dt);
run;

Editted addition:

@Tom pointed out that this code is not entirely clear. Here's an explanation that I think will help:

Each ID is processed twice, once to build up unique counts for each date, and the second time to retrieve thouse counts.

During the first pass of each ID, for each incoming DT, this program updates the prod_by_date object for every date in the 11-day window _D= DT-10 through DT. Given the time gaps, most of those _D values don’t show up as observed dates in the incoming record – but it doesn’t matter - this is just building a calendar.

If a date _D in the window does not have the product in hand, then (1) a dataitem (think “a row”) keyed on product and _D is added to the prod_by_date object, and (2) the unique count for _D is incremented in object DATE_CENSUS, keyed only on _D. So every date in DATE_CENSUS could belong to as many as 11 windows. It is this object that is retrieved during the second pass to get unique counts.

During the second pass I could have:

  _d=dt;
  date_census.find();

But this does the same thing (by substituting DT for the default hash key variable _D).

  date_census.find(key:dt);

I've submitted a better explanation of this code avoids double counting if a product shows up more than once within a ten-day window. It's in a separate note in this thread.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 07-16-2023 03:13 PM

So for each encountered date (for a given ID), you want a count of unique products dated up to ten days prior through ten days later, yes? I.e. a date span of 21 days.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Zatere · Posted 07-16-2023 03:50 PM

Hi thank you for your reply. Not exactly. For each given ID and DT (you may say for each row) I want to count the unique products for the next 10 days (10 days is just an example; it could also be for the next 15 days).

Taking ID = 1 as an example:

for the row 1 and for the next 10 days this ID has 2 unique products
for the row 2 and for the next 10 days this ID has 1 unique producy
and so on

PaigeMiller · Posted 07-16-2023 03:19 PM

Does the 10 day period start on the date in the data set (variable named DT)? Or some other start point?

--
Paige Miller

Zatere · Posted 07-16-2023 03:51 PM

Hi thank you for your reply.

Yes, the 10 day period starts on the date in the data set (variable named DT) and counts unique products for the next 10 days.

Tom · Posted 07-16-2023 03:42 PM

You clearly are not look for CONSECUTIVE days since there are gaps between all of the dates in your source dataset. So perhaps you meant an interval (or period) of X days. So what is the starting date of the interval? What is the ending date?

If I assume that the first date in the HAVE dataset should be the 5th of the month so that it matches the WANT dataset then using the 11 day period that starts of the listed DT and includes the next 10 days seems to generate the same result as you wanted.

In general the easiest way to code something like that it to to a self-join in SQL code.

proc sql;
create table want as 
  select distinct a.*,count(distinct b.product) as uniq_products 
  from have a 
  left join have b
  on a.id=b.id and b.dt between a.dt and a.dt+10
  group by a.id,a.dt
;
quit;

Result

PS Displaying dates in either DMY or MDY order will confuse half of your audience. Use YMD order or DATE format instead.

Zatere · Posted 07-16-2023 03:54 PM

Hi thank you for your reply.

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.

mkeintz · Posted 07-16-2023 04:00 PM

@Zatere wrote:

Hi thank you for your reply.

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.

If the SQL solution "is not the best in large volume of data" you can run a DATA step that reduces memory needs (and possibly time) dramatically by relying on the fact that the data are sorted by ID. It uses hash objects that are cleared for re-use each time a new ID is encountered:

data have;
  input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
  format DT ddmmyy10.;
  format DT date9. ;
datalines;
1 05/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
run;


data want (drop=_:);
  set have (in=firstpass)   have (in=secondpass) ;
  by id;

  if _n_=1 then do;
    _d=.;   /* Use this for date index */
    declare hash prod_by_date ();
      prod_by_date.definekey('_d','product');
      prod_by_date.definedone();
    declare hash date_census();
      date_census.definekey('_d');
      date_census.definedata('_d','uniq_product');
      date_census.definedone();
  end;

  if first.id then do;
    _rc=prod_by_date.clear();
    _rc=date_census.clear();
  end;

  if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

  if secondpass;
  date_census.find(key:dt);
run;

Editted addition:

@Tom pointed out that this code is not entirely clear. Here's an explanation that I think will help:

Each ID is processed twice, once to build up unique counts for each date, and the second time to retrieve thouse counts.

During the first pass of each ID, for each incoming DT, this program updates the prod_by_date object for every date in the 11-day window _D= DT-10 through DT. Given the time gaps, most of those _D values don’t show up as observed dates in the incoming record – but it doesn’t matter - this is just building a calendar.

If a date _D in the window does not have the product in hand, then (1) a dataitem (think “a row”) keyed on product and _D is added to the prod_by_date object, and (2) the unique count for _D is incremented in object DATE_CENSUS, keyed only on _D. So every date in DATE_CENSUS could belong to as many as 11 windows. It is this object that is retrieved during the second pass to get unique counts.

During the second pass I could have:

  _d=dt;
  date_census.find();

But this does the same thing (by substituting DT for the default hash key variable _D).

  date_census.find(key:dt);

I've submitted a better explanation of this code avoids double counting if a product shows up more than once within a ten-day window. It's in a separate note in this thread.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Zatere · Posted 07-16-2023 04:49 PM

Amazing

Tom · Posted 07-16-2023 05:09 PM

It is not clear how that finds unique products within the date range.

Tom · Posted 07-16-2023 11:02 PM

Still does not make sense to me.

How is it not going to count A and B twice in a series like:

date product
1 A
2 B
3 A
4 B
5 C

There are 3 distinct products there. But on each date there is one distinct product. If you add up five ones you get 5 instead of 3.

mkeintz · Posted 07-17-2023 09:56 PM

@Tom wrote:

Still does not make sense to me.

How is it not going to count A and B twice in a series like:
date product
1 A
2 B
3 A
4 B
5 C
There are 3 distinct products there. But on each date there is one distinct product. If you add up five ones you get 5 instead of 3.

First, sorry for the delayed response. I've been offline.

This program won''t count A&B twice due to the following code (in particular, the "if prod_by_date ..." statement).

 if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

For the first record, (Date=1, product=A ) the prod_by_date hash object (keyed on both product and date) will be updated by adding dataitems for product A and dates (represented by variable _D) in the range [-9,+1] At the same time, since these are the first dataitems for those dates, each of the corresponding dates in date_census object (keyed by _D) gets a new dataitem with 1 as initial unique product count.

For the second record, with a new product B, prod_by_date gets new dataitems for prod B and date range [-8,+2]. DATE=2 is completely new, so date_census gets uniq_product=1 for that date, but all the dates from -8 through +1 already exist (i.e. if date_census.find()=0 is true), so those dates will have uniq_product incremented by 1 in date_census. Each of those dates now have both A and B.

Now for the duplicate products:

For the third record, (DATE=3, PRODUCT=A) all the dates in [-7,+3] are checked for product A in the prod_by_date object. For dates in [-7,+1] there will already be a dataitem in place (i.e the find method would return a zero), meaning that the
if prod_by_date.find()^=0 then do;
test will fail, so neither prod_by_date nor date_census will be modified by code inside the do group. But for [+2,+3] there is not yet a product A (i.e. the if test is true), so both objects will be updated for those two dates. For date 2, uniq_product increments to 2 but for date 3, uniq_product is set to 1.

Fourth record (DATE=4,PRODUCT=B) will similarly leave prod_by_date and date_census unmodified for dates [-6,+2], but updated for dates [+3,+4].

And finally (DATE=5 PRODUCT=C) will add a new entry for date 5 with uniq_prod=1, and will increment the uniq_prod value for dates [-5,+4]

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 07-18-2023 10:01 AM

Not sure I follow that, but also it does not appear to have anyway to tell that a product might drop out of the list of products when the date window has moved forward. Or perhaps when the product should not have dropped out because of a later date that also had that same product.

Take a look at the HASH method I posted where the hash of found products is cleared before counting again for new source date. Perhaps your method of building a date*product hash on the first pass could work, but the hash that is counting future products from the current date would need to be cleared before looping over the future dates.

Tom · Posted 07-16-2023 05:12 PM

@Zatere wrote:

Hi thank you for your reply.

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.

SQL performance will depend on your database implementation. What indexes exists etc. and perhaps which database engine you are using to actual store the data.

You could try just reading the data multiple times to count the distinct products in a the following records. Perhaps something like:

data want ;
  if _n_=1 then do ;
     declare hash h();
     rc=h.definekey('next_product');
     rc=h.definedata('next_product');
     rc=h.definedone();
  end;
  set have nobs=nobs;
  by id dt;
  rc=h.clear();
  do p=_n_ to nobs ;
    set have(keep=id dt product rename=(id=next_id dt=next_dt product=next_product)) point=p;
    if (id ne next_id) or (next_dt > (dt+10)) then leave;
    rc=h.add();
  end;
  uniq_products = h.num_items;
  drop rc next_: ;
run;

You could do some tests with different sized datasets to see which one actually runs faster.

Registration is open

SAS Training: Just a Click Away