BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dolldata
Obsidian | Level 7

Hello,

 

I am trying to summarize customer behavior to determine on each one of their orders if they were a new customer on that order or an existing customer.  To make things more difficult, I want to break the existing customers into more specific groups, based on if they had purchased within a year before or not.  Furthermore, if they had purchased in the previous year, break them apart by how much they spent in that previous year.  I have a dataset of customers and all of their orders, and want to leave it at the order level (one line per order, not one line per customer).

 

For an example household I already have the first four columns, but want to calculate the last two columns (demand366 and demand732) to be able to then classify existing customers.

 

customerfirst_orderorder_datedemanddemand366demand732
12303/15/1503/15/15$36.25$0.00$0.00
12303/15/1503/18/15$50.00$36.25$36.25
12303/15/1511/11/15$126.00$86.25$86.25
12303/15/1502/06/16$72.00$212.25$212.25
12303/15/1510/20/16$186.95$198.00$284.25
12303/15/1510/15/17$45.00$186.95$384.95
12303/15/1501/27/18$92.00$45.00$303.95

 

So, for customer 123, on their first order (3/15/15), they would have $0 in demand for the 12 months (366 days) and 24 months (732 days) PRIOR to that first order.  On their next order, 3/18/15, they would have spent $36.25 PRIOR to that order (both for 12 and 24 months).  To get more complicated, if you look at the order on 10/20/16, only the prior two orders were within 12 months, so the demand366 total would be $198.00 ($126 + $72), but the 24 month total would include all four prior orders ($36.25 + $50 + $126 + $72 = $284.25).

 

I'm struggling on how to get SAS to calculate this, and if do loops or first./last. or something within proc sql would work best...or some combination of all of those methods.  I know I am over-complicating this in my mind...I feel there should be a fairly easy way to do this?  Any suggestions would be greatly appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@dolldata

Assuming all your data is in SAS tables here a data step option which should perform much better.

data want2(drop=_:);
  if _n_=1 then
    do;
      if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
      dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
      h1.defineKey('customer');
      h1.defineData('_order_date','_demand');
      h1.defineDone();
    end;

  set have;

  demand366_calc=0;
  demand732_calc=0;
  do while(h1.do_over() eq 0);
    if order_date > _order_date then
      do;
        if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
        if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
      end;
  end;
run;

This option loads all the rows and required variables from your source into a hash lookup table in memory. 7.4 million records should fit but in case you encounter out-of-memory conditions: There would be less resource hungry alternatives/extensions - but they would require additional code logic (=harder to understand and maintain).

 

Change 16/05/2018:

Code above fixed based on @dolldata's comment (replace source table name WANT2 with HAVE).

View solution in original post

7 REPLIES 7
Patrick
Opal | Level 21

@dolldata

Please post in the future your sample data in the form of a working SAS data step so we can spend time on answering your question instead of reading your data into a SAS dataset.
Below SQL should do what you're asking for.

proc sql;
  create table want as
    select 
      customer,   
      first_order,
      order_date, 
      demand,     
      demand366,  
      demand732,
      coalesce((select sum(i.demand) from have as I where i.customer=o.customer and i.order_date between o.order_date-1 and o.order_date-366),0) as demand366_calc,
      coalesce((select sum(i.demand) from have as I where i.customer=o.customer and i.order_date between o.order_date-1 and o.order_date-732),0) as demand732_calc
    from have as O
    ;
quit;

 

dolldata
Obsidian | Level 7
Apologies for not putting the data in a step - that was a big miss. But thank you for this solution - it works perfectly! I had not heard of the coalesce function, but will definitely be researching it more.
dolldata
Obsidian | Level 7

I hate to add another reply to the string, especially after accepting the solution, but is there a way to speed up the calculations?  I am running this on about 7.4 million records, summarizing the activity of 974K customers (we are running SAS Studio release 3.6, with the processing on a server).

 

I realize it should take a long time, as it is doing so many calculations on each customer, but at the current rate, it will take about three days to get the processing done.  Which may just be the way it is.  Just curious if there are ideas on how to optimize that?

 

 

Patrick
Opal | Level 21

@dolldata

The SQL code posted is certainly not the most performant option. Is your source table in SAS or in a database (and if database: Which one?).

 

Patrick
Opal | Level 21

@dolldata

Assuming all your data is in SAS tables here a data step option which should perform much better.

data want2(drop=_:);
  if _n_=1 then
    do;
      if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
      dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
      h1.defineKey('customer');
      h1.defineData('_order_date','_demand');
      h1.defineDone();
    end;

  set have;

  demand366_calc=0;
  demand732_calc=0;
  do while(h1.do_over() eq 0);
    if order_date > _order_date then
      do;
        if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
        if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
      end;
  end;
run;

This option loads all the rows and required variables from your source into a hash lookup table in memory. 7.4 million records should fit but in case you encounter out-of-memory conditions: There would be less resource hungry alternatives/extensions - but they would require additional code logic (=harder to understand and maintain).

 

Change 16/05/2018:

Code above fixed based on @dolldata's comment (replace source table name WANT2 with HAVE).

dolldata
Obsidian | Level 7

I am working with SAS datasets.

 

I kept getting an error that the first instance of the 'want2' table didn't exist:

 

 

 

 78         data want2(drop=_:);
 79         
 80           if _n_=1 then
 81             do;
 82               if 0 then set want2(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
 ERROR: File WORK.WANT2.DATA does not exist.
 83               dcl hash h1(dataset:'want2(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))',
 83       ! multidata:'y');
 84               h1.defineKey('customer');
 85               h1.defineData('_order_date','_demand');
 86               h1.defineDone();
 87             end;
 88         
 89           set have;
 90         
 91           demand366_calc=0;
 92           demand732_calc=0;
 93           do while(h1.do_over() eq 0);
 94             if order_date > _order_date then
 95               do;
 96                 if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
 97                 if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
 98               end;
 99           end;
 100        run;

 

After trial and error, replacing 'want2' with 'have' within the datastep made it work:

 

data want2(drop=_:);

  if _n_=1 then
    do;
      if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
      dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
      h1.defineKey('customer');
      h1.defineData('_order_date','_demand');
      h1.defineDone();
    end;

  set have;

  demand366_calc=0;
  demand732_calc=0;
  do while(h1.do_over() eq 0);
    if order_date > _order_date then
      do;
        if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
        if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
      end;
  end;
run;

 

I have spot-checked several records and everything appears to be calculating correctly.  This code is incredible - processes all 7.4 million records in just 15 seconds.  I definitely need to spend more time understanding hash processing!

 

Thank you again for the help! 

Patrick
Opal | Level 21

@dolldata

You've passed the test (just joking!). But great that you've identified and fixed the issue yourself Smiley Happy and that things are working in your environment.

 

I suggest that if you still can mark the hash code as solution instead of the SQL code.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1458 views
  • 3 likes
  • 2 in conversation