BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Zatere
Quartz | Level 8

Hello,

 

I would ask for your assistance on the following.

 

I have data in the following format:

 

 

data have;
input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
format DT ddmmyy10.;
datalines;
1 04/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
;
run;

 

 

I would like to be able to know for each ID how many unique products within a consecutive 10 days period.

 

The output should be like the below:

 

 

data want;
input ID DT :ddmmyy10. PRODUCT$ AMOUNT uniq_products;
format DT ddmmyy10.;
datalines;
1 05/02/2023 A 100 2
1 15/02/2023 B 55 1
1 01/03/2023 B 65 2
1 07/03/2023 A 89 2
1 10/03/2023 B 75 1
1 24/03/2023 A 300 1
2 15/02/2023 C 400 2
2 25/02/2023 D 56 1
;
run;

 

ID	DT	PRODUCT	AMOUNT	uniq_products
1	05/02/2023	A	100	2
1	15/02/2023	B	55	1
1	01/03/2023	B	65	2
1	07/03/2023	A	89	2
1	10/03/2023	B	75	1
1	24/03/2023	A	300	1
2	15/02/2023	C	400	2
2	25/02/2023	D	56	1

 

 

Any help please?

 

Thanks in advance.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@Zatere wrote:

Hi thank you for your reply. 

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.


If the SQL solution "is not the best in large volume of data" you can run a DATA step that reduces memory needs (and possibly time) dramatically by relying on the fact that the data are sorted by ID.  It uses hash objects that are cleared for re-use each time a new ID is encountered:

 

data have;
  input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
  format DT ddmmyy10.;
  format DT date9. ;
datalines;
1 05/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
run;


data want (drop=_:);
  set have (in=firstpass)   have (in=secondpass) ;
  by id;

  if _n_=1 then do;
    _d=.;   /* Use this for date index */
    declare hash prod_by_date ();
      prod_by_date.definekey('_d','product');
      prod_by_date.definedone();
    declare hash date_census();
      date_census.definekey('_d');
      date_census.definedata('_d','uniq_product');
      date_census.definedone();
  end;

  if first.id then do;
    _rc=prod_by_date.clear();
    _rc=date_census.clear();
  end;

  if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

  if secondpass;
  date_census.find(key:dt);
run;

 

Editted addition:

@Tom pointed out that this code is not entirely clear.  Here's an explanation that I think will help:

 

Each ID is processed twice, once to build up unique counts for each date, and the second time to retrieve thouse counts.

 

During the first pass of each ID, for each incoming DT, this program updates the prod_by_date object for every date in the 11-day window _D= DT-10 through DT.  Given the time gaps, most of those _D values don’t show up as observed dates in the incoming record – but it doesn’t matter - this is just building a calendar.

 

If a date _D in the window does not have the product in hand, then (1) a dataitem (think “a row”) keyed on product and _D is added to the  prod_by_date object, and (2) the unique count for _D is incremented in object DATE_CENSUS, keyed only on _D.   So every date in DATE_CENSUS could belong to as many as 11 windows.  It is this object that is retrieved during the second pass to get unique counts.

 

 During the second pass I could have:

 

  _d=dt;
  date_census.find();

But this does the same thing (by substituting DT for the default hash key variable _D).

 

 

  date_census.find(key:dt);

 

I've submitted a better explanation of this code avoids double counting if a product shows up more than once within a ten-day window.  It's in a separate note in this thread.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
mkeintz
PROC Star

So for each encountered date  (for a given ID), you want a count of unique products dated up to ten days prior through ten days later, yes?  I.e. a date span of 21 days.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Zatere
Quartz | Level 8

Hi thank you for your reply. Not exactly. For each given ID and DT (you may say for each row) I want to count the unique products for the next 10 days (10 days is just an example; it could also be for the next 15 days).

Taking  ID = 1 as an example: 

for the row 1 and for the next 10 days this ID has 2 unique products
for the row 2 and for the next 10 days this ID has 1 unique producy
and so on
PaigeMiller
Diamond | Level 26

Does the 10 day period start on the date in the data set (variable named DT)? Or some other start point?

--
Paige Miller
Zatere
Quartz | Level 8

Hi thank you for your reply. 

Yes, the 10 day period starts on the date in the data set (variable named DT) and counts unique products for the next 10 days.

 

Tom
Super User Tom
Super User

You clearly are not look for CONSECUTIVE days since there are gaps between all of the dates in your source dataset.  So perhaps you meant an interval (or period) of X days.  So what is the starting date of the interval? What is the ending date?

 

If I assume that the first date in the HAVE dataset should be the 5th of the month so that it matches the WANT dataset then using the 11 day period that starts of the listed DT and includes the next 10 days seems to generate the same result as you wanted.

 

In general the easiest way to code something like that it to to a self-join in SQL code.

proc sql;
create table want as 
  select distinct a.*,count(distinct b.product) as uniq_products 
  from have a 
  left join have b
  on a.id=b.id and b.dt between a.dt and a.dt+10
  group by a.id,a.dt
;
quit;

Result

Tom_0-1689536492246.png

 

PS Displaying dates in either DMY or MDY order will confuse half of your audience.  Use YMD order or DATE format instead.

 

Zatere
Quartz | Level 8

Hi thank you for your reply. 

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.

mkeintz
PROC Star

@Zatere wrote:

Hi thank you for your reply. 

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.


If the SQL solution "is not the best in large volume of data" you can run a DATA step that reduces memory needs (and possibly time) dramatically by relying on the fact that the data are sorted by ID.  It uses hash objects that are cleared for re-use each time a new ID is encountered:

 

data have;
  input ID DT :ddmmyy10. PRODUCT$ AMOUNT;
  format DT ddmmyy10.;
  format DT date9. ;
datalines;
1 05/02/2023 A 100
1 15/02/2023 B 55
1 01/03/2023 B 65
1 07/03/2023 A 89
1 10/03/2023 B 75
1 24/03/2023 A 300
2 15/02/2023 C 400
2 25/02/2023 D 56
run;


data want (drop=_:);
  set have (in=firstpass)   have (in=secondpass) ;
  by id;

  if _n_=1 then do;
    _d=.;   /* Use this for date index */
    declare hash prod_by_date ();
      prod_by_date.definekey('_d','product');
      prod_by_date.definedone();
    declare hash date_census();
      date_census.definekey('_d');
      date_census.definedata('_d','uniq_product');
      date_census.definedone();
  end;

  if first.id then do;
    _rc=prod_by_date.clear();
    _rc=date_census.clear();
  end;

  if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

  if secondpass;
  date_census.find(key:dt);
run;

 

Editted addition:

@Tom pointed out that this code is not entirely clear.  Here's an explanation that I think will help:

 

Each ID is processed twice, once to build up unique counts for each date, and the second time to retrieve thouse counts.

 

During the first pass of each ID, for each incoming DT, this program updates the prod_by_date object for every date in the 11-day window _D= DT-10 through DT.  Given the time gaps, most of those _D values don’t show up as observed dates in the incoming record – but it doesn’t matter - this is just building a calendar.

 

If a date _D in the window does not have the product in hand, then (1) a dataitem (think “a row”) keyed on product and _D is added to the  prod_by_date object, and (2) the unique count for _D is incremented in object DATE_CENSUS, keyed only on _D.   So every date in DATE_CENSUS could belong to as many as 11 windows.  It is this object that is retrieved during the second pass to get unique counts.

 

 During the second pass I could have:

 

  _d=dt;
  date_census.find();

But this does the same thing (by substituting DT for the default hash key variable _D).

 

 

  date_census.find(key:dt);

 

I've submitted a better explanation of this code avoids double counting if a product shows up more than once within a ten-day window.  It's in a separate note in this thread.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

It is not clear how that finds unique products within the date range.

Tom
Super User Tom
Super User

Still does not make sense to me.

 

How is it not going to count A and B twice in a series like:

date product
1 A
2 B
3 A
4 B
5 C

There are 3 distinct products there.  But on each date there is one distinct product.  If you add up five ones you get 5 instead of 3.

mkeintz
PROC Star

@Tom wrote:

Still does not make sense to me.

 

How is it not going to count A and B twice in a series like:

date product
1 A
2 B
3 A
4 B
5 C

There are 3 distinct products there.  But on each date there is one distinct product.  If you add up five ones you get 5 instead of 3.


First, sorry for the delayed response.  I've been offline.

 

This program won''t count A&B twice due to the following code (in particular, the "if prod_by_date ..." statement).

 

 if firstpass then do _d=dt-10 to dt;
    if prod_by_date.find()^=0 then do;      
      prod_by_date.add();
      if date_census.find()=0 then uniq_product=uniq_product+1;
      else uniq_product=1;
      date_census.replace();
    end;
  end;

 

 

For the first record, (Date=1, product=A ) the prod_by_date hash object (keyed on both product and date) will be updated by adding dataitems for product A and dates (represented by variable _D) in the range [-9,+1]  At the same time, since these are the first dataitems for those dates, each of the  corresponding dates in date_census object (keyed by _D) gets a new dataitem with 1 as initial unique product count.

 

For the second record, with a new product B, prod_by_date gets new dataitems for prod B and date range [-8,+2].  DATE=2 is completely new, so date_census gets uniq_product=1 for that date, but all the dates from -8 through +1 already exist (i.e. if date_census.find()=0  is true), so those dates will have uniq_product incremented by 1 in date_census.  Each of those dates now have both A and B.

 

Now for the duplicate products:

 

For the third record, (DATE=3, PRODUCT=A) all the dates in [-7,+3] are checked for product A in the prod_by_date object.  For dates in [-7,+1] there will already be a dataitem in place (i.e the find method would return a zero), meaning that the 
             if prod_by_date.find()^=0 then do;
test will fail, so neither prod_by_date nor date_census will be modified by code inside the do group.  But for [+2,+3] there is not yet a product A (i.e. the if test is true), so both objects will be updated for those two dates.  For date 2, uniq_product increments to 2 but for date 3, uniq_product is set to 1.

 

Fourth record (DATE=4,PRODUCT=B) will similarly leave prod_by_date and date_census unmodified for dates [-6,+2], but updated for dates [+3,+4].

 

And finally (DATE=5 PRODUCT=C) will add a new entry for date 5 with uniq_prod=1, and will increment the uniq_prod value for dates [-5,+4]

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Not sure I follow that, but also it does not appear to have anyway to tell that a product might drop out of the list of products when the date window has moved forward.  Or perhaps when the product  should not have dropped out because of a later date that also had that same product. 

 

Take a look at the HASH method I posted where the hash of found products is cleared before counting again for new source date.  Perhaps your method of building a date*product hash on the first pass could work, but the hash that is counting future products from the current date would need to be cleared before looping over the future dates.

Tom
Super User Tom
Super User

@Zatere wrote:

Hi thank you for your reply. 

Sorry for the confusion; an interval (or period) of X days makes more sense.

This solution is not the best in large volume of data.


SQL performance will depend on your database implementation. What indexes exists etc. and perhaps which database engine you are using to actual store the data.

 

You could try just reading the data multiple times to count the distinct products in a the following records.  Perhaps something like:

data want ;
  if _n_=1 then do ;
     declare hash h();
     rc=h.definekey('next_product');
     rc=h.definedata('next_product');
     rc=h.definedone();
  end;
  set have nobs=nobs;
  by id dt;
  rc=h.clear();
  do p=_n_ to nobs ;
    set have(keep=id dt product rename=(id=next_id dt=next_dt product=next_product)) point=p;
    if (id ne next_id) or (next_dt > (dt+10)) then leave;
    rc=h.add();
  end;
  uniq_products = h.num_items;
  drop rc next_: ;
run;

You could do some tests with different sized datasets to see which one actually runs faster.

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2448 views
  • 2 likes
  • 4 in conversation