turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Data rollup - sum previous rows if criteria are me...

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-10-2018 07:20 PM

Hello,

I am trying to summarize customer behavior to determine on each one of their orders if they were a new customer on that order or an existing customer. To make things more difficult, I want to break the existing customers into more specific groups, based on if they had purchased within a year before or not. Furthermore, if they had purchased in the previous year, break them apart by how much they spent in that previous year. I have a dataset of customers and all of their orders, and want to leave it at the order level (one line per order, not one line per customer).

For an example household I already have the first four columns, but want to calculate the last two columns (demand366 and demand732) to be able to then classify existing customers.

customer | first_order | order_date | demand | demand366 | demand732 |

123 | 03/15/15 | 03/15/15 | $36.25 | $0.00 | $0.00 |

123 | 03/15/15 | 03/18/15 | $50.00 | $36.25 | $36.25 |

123 | 03/15/15 | 11/11/15 | $126.00 | $86.25 | $86.25 |

123 | 03/15/15 | 02/06/16 | $72.00 | $212.25 | $212.25 |

123 | 03/15/15 | 10/20/16 | $186.95 | $198.00 | $284.25 |

123 | 03/15/15 | 10/15/17 | $45.00 | $186.95 | $384.95 |

123 | 03/15/15 | 01/27/18 | $92.00 | $45.00 | $303.95 |

So, for customer 123, on their first order (3/15/15), they would have $0 in demand for the 12 months (366 days) and 24 months (732 days) PRIOR to that first order. On their next order, 3/18/15, they would have spent $36.25 PRIOR to that order (both for 12 and 24 months). To get more complicated, if you look at the order on 10/20/16, only the prior two orders were within 12 months, so the demand366 total would be $198.00 ($126 + $72), but the 24 month total would include all four prior orders ($36.25 + $50 + $126 + $72 = $284.25).

I'm struggling on how to get SAS to calculate this, and if do loops or first./last. or something within proc sql would work best...or some combination of all of those methods. I know I am over-complicating this in my mind...I feel there should be a fairly easy way to do this? Any suggestions would be greatly appreciated!

Accepted Solutions

Solution

05-15-2018
03:56 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to dolldata

05-15-2018 02:38 AM - edited 05-15-2018 03:20 PM

Assuming all your data is in SAS tables here a data step option which should perform much better.

```
data want2(drop=_:);
if _n_=1 then
do;
if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
h1.defineKey('customer');
h1.defineData('_order_date','_demand');
h1.defineDone();
end;
set have;
demand366_calc=0;
demand732_calc=0;
do while(h1.do_over() eq 0);
if order_date > _order_date then
do;
if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
end;
end;
run;
```

This option loads all the rows and required variables from your source into a hash lookup table in memory. 7.4 million records should fit but in case you encounter out-of-memory conditions: There would be less resource hungry alternatives/extensions - but they would require additional code logic (=harder to understand and maintain).

Change 16/05/2018:

Code above fixed based on @dolldata's comment (replace source table name WANT2 with HAVE).

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to dolldata

05-10-2018 08:07 PM - edited 05-10-2018 08:08 PM

Please post in the future your sample data in the form of a working SAS data step so we can spend time on answering your question instead of reading your data into a SAS dataset.

Below SQL should do what you're asking for.

```
proc sql;
create table want as
select
customer,
first_order,
order_date,
demand,
demand366,
demand732,
coalesce((select sum(i.demand) from have as I where i.customer=o.customer and i.order_date between o.order_date-1 and o.order_date-366),0) as demand366_calc,
coalesce((select sum(i.demand) from have as I where i.customer=o.customer and i.order_date between o.order_date-1 and o.order_date-732),0) as demand732_calc
from have as O
;
quit;
```

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Patrick

05-10-2018 11:26 PM

Apologies for not putting the data in a step - that was a big miss. But thank you for this solution - it works perfectly! I had not heard of the coalesce function, but will definitely be researching it more.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Patrick

05-15-2018 01:04 AM

I hate to add another reply to the string, especially after accepting the solution, but is there a way to speed up the calculations? I am running this on about 7.4 million records, summarizing the activity of 974K customers (we are running SAS Studio release 3.6, with the processing on a server).

I realize it should take a long time, as it is doing so many calculations on each customer, but at the current rate, it will take about three days to get the processing done. Which may just be the way it is. Just curious if there are ideas on how to optimize that?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to dolldata

05-15-2018 01:34 AM

The SQL code posted is certainly not the most performant option. Is your source table in SAS or in a database (and if database: Which one?).

Solution

05-15-2018
03:56 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to dolldata

05-15-2018 02:38 AM - edited 05-15-2018 03:20 PM

Assuming all your data is in SAS tables here a data step option which should perform much better.

```
data want2(drop=_:);
if _n_=1 then
do;
if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
h1.defineKey('customer');
h1.defineData('_order_date','_demand');
h1.defineDone();
end;
set have;
demand366_calc=0;
demand732_calc=0;
do while(h1.do_over() eq 0);
if order_date > _order_date then
do;
if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
end;
end;
run;
```

This option loads all the rows and required variables from your source into a hash lookup table in memory. 7.4 million records should fit but in case you encounter out-of-memory conditions: There would be less resource hungry alternatives/extensions - but they would require additional code logic (=harder to understand and maintain).

Change 16/05/2018:

Code above fixed based on @dolldata's comment (replace source table name WANT2 with HAVE).

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Patrick

05-15-2018 10:06 AM

I am working with SAS datasets.

I kept getting an error that the first instance of the 'want2' table didn't exist:

```
78 data want2(drop=_:);
79
80 if _n_=1 then
81 do;
82 if 0 then set want2(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
ERROR: File WORK.WANT2.DATA does not exist.
83 dcl hash h1(dataset:'want2(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))',
83 ! multidata:'y');
84 h1.defineKey('customer');
85 h1.defineData('_order_date','_demand');
86 h1.defineDone();
87 end;
88
89 set have;
90
91 demand366_calc=0;
92 demand732_calc=0;
93 do while(h1.do_over() eq 0);
94 if order_date > _order_date then
95 do;
96 if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
97 if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
98 end;
99 end;
100 run;
```

After trial and error, replacing 'want2' with 'have' within the datastep made it work:

```
data want2(drop=_:);
if _n_=1 then
do;
if 0 then set have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand));
dcl hash h1(dataset:'have(keep=customer order_date demand rename=(order_date=_order_date demand=_demand))', multidata:'y');
h1.defineKey('customer');
h1.defineData('_order_date','_demand');
h1.defineDone();
end;
set have;
demand366_calc=0;
demand732_calc=0;
do while(h1.do_over() eq 0);
if order_date > _order_date then
do;
if _order_date >= order_date-366 then demand366_calc=sum(demand366_calc,_demand);
if _order_date >= order_date-732 then demand732_calc=sum(demand732_calc,_demand);
end;
end;
run;
```

I have spot-checked several records and everything appears to be calculating correctly. This code is incredible - processes all 7.4 million records in just 15 seconds. I definitely need to spend more time understanding hash processing!

Thank you again for the help!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to dolldata

05-15-2018 03:17 PM - edited 05-15-2018 03:34 PM

You've passed the test (just joking!). But great that you've identified and fixed the issue yourself and that things are working in your environment.

I suggest that if you still can mark the hash code as solution instead of the SQL code.