New SAS user! First post!
I’m trying to create an output which calculatesO. The cohorts are defined by the year which the customer shopped for the first time. So, cohort 2003 will only consist of those that shopped for the first time in 2003 and not include those that shopped for the first time in 2000, 2001, 2002. I'm having trouble figuring out a more efficient way to do this and calculate the final metric across the years.
The data set I start out with, looks like this.
CUSTOMER_ID | transaction_date | total | purchase_orders |
98 | 2-May-03 | 52 | 1 |
324 | 31-Mar-01 | 316.5 | 1 |
386 | 14-Sep-03 | 52.5 | 1 |
409 | 13-Jan-01 | 44.99 | 1 |
409 | 12-Jun-01 | 52.5 | 1 |
430 | 8-Dec-03 | 65 | 1 |
469 | 19-Mar-03 | 37.5 | 1 |
469 | 4-Nov-03 | 102.96 | 1 |
469 | 17-Nov-03 | 29.99 | 1 |
527 | 14-Dec-03 | 14.99 | 1 |
764 | 31-Jul-03 | 124.97 | 1 |
959 | 13-Aug-02 | 92.49 | 1 |
1119 | 17-Dec-03 | 450 | 1 |
1152 | 14-May-02 | 85 | 2 |
1152 | 18-Oct-02 | 162.99 | 1 |
1152 | 24-Apr-03 | 98.49 | 1 |
1152 | 12-Jun-03 | 52.5 | 1 |
1189 | 11-Jul-01 | 41.98 | 1 |
Average order volume is calculated by:
total / purchase_orders
To produce the cohort groups, I'm creating separate transaction data sets for each year and then and then removing the customer_ids which shopped the previous years.
proc sql;
create table ltv.ID_2000 as
select distinct
year(transaction_date) as year,
CUSTOMER_ID,
count( distinct CUSTOMER_ID ) as count
from ltv._2000; quit;
proc sql;
create table ltv.ID_2001 as
select distinct
year(transaction_date) as year,
CUSTOMER_ID,
count( distinct CUSTOMER_ID ) as count
from ltv._2001
where CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000);quit;
proc sql;
create table ltv.ID_2002 as
select distinct
year(transaction_date) as year,
CUSTOMER_ID,
count( distinct CUSTOMER_ID ) as count
from ltv._2002
where CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000)
and CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2001);quit;
proc sql;
create table ltv.ID_2003 as
select distinct
year(transaction_date) as year,
CUSTOMER_ID,
count( distinct CUSTOMER_ID ) as count
from ltv._2003
where CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000)
and CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2001)
and CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2002);quit;
Then I have to join all the data sets together .
proc sql;
create table ltv.join_master as
select distinct
from ltv.ID_2001 a
left join ltv._2002 c
on a.CUSTOMER_ID=c.CUSTOMER_ID
left join ltv._2003 d
on a.CUSTOMER_ID=d.CUSTOMER_ID
left join ltv._2004 e
on a.CUSTOMER_ID=e.CUSTOMER_ID
left join ltv._2005 f
on a.CUSTOMER_ID=f.CUSTOMER_ID;
quit;
Any this is were I'm stuck.
The final output I'm looking for should look something like this. A variable for customer cohort (unique first time customers) count and then and then average order volume variables for each consecutive year for that cohort.
Year Of First Purchase | Customer Cohort Count | Average Order Volume bought in Year + 1 | Average Order Volume bought in Year + 2 | Average Order Volume bought in Year + 3 | Average Order Volume bought in Year + 4 | Average Order Volume bought in Year + 5 | Average Order Volume bought in Year + 6 | Average Order Volume bought in Year + 7 | Average Order Volume bought in Year + 8 | Average Order Volume bought in Year + 9 | Average Order Volume bought in Year + 10 | Average Order Volume bought in Year + 11 | Average Order Volume bought in Year + 12 | Average Order Volume bought in Year + 13 | Average Order Volume bought in Year + 14 | Average Order Volume bought in Year + 15 |
2000 | 11,865 | 120.55 | 120.55 | 120.55 | 120.55 | 222.00 | 555.00 | 555.00 | 222.00 | 222.00 | 222.00 | 0.08 | 233.66 | 233.66 | 454.99 | 333.33 |
2001 | 64,209 | 225.56 | 225.56 | 225.56 | 225.56 | 225.56 | 684.00 | 684.00 | 225.56 | 225.56 | 225.56 | 0.07 | 454.99 | 454.99 | 0.04 | N/A |
2002 | 85,932 | 145.33 | 145.33 | 145.33 | 145.33 | 145.33 | 656.44 | 656.44 | 145.33 | 145.33 | 145.33 | 0.06 | 120.55 | 120.55 | N/A | N/A |
2003 | 142,229 | 185.99 | 185.99 | 185.99 | 185.99 | 185.99 | 555.65 | 555.65 | 185.99 | 185.99 | 185.99 | 0.05 | 225.56 | N/A | N/A | N/A |
2004 | 194,213 | 233.66 | 233.66 | 233.66 | 233.66 | 233.66 | 222.00 | 222.00 | 233.66 | 233.66 | 233.66 | 0.03 | N/A | N/A | N/A | N/A |
2005 | 219,872 | 454.99 | 454.99 | 454.99 | 454.99 | 225.56 | 225.56 | 225.56 | 225.56 | 225.56 | 225.56 | N/A | N/A | N/A | N/A | N/A |
2006 | 248,812 | 120.55 | 120.55 | 120.55 | 120.55 | 145.33 | 145.33 | 145.33 | 145.33 | 145.33 | N/A | N/A | N/A | N/A | N/A | N/A |
2007 | 264,167 | 225.56 | 225.56 | 225.56 | 225.56 | 185.99 | 222.00 | 222.00 | 185.99 | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2008 | 330,208 | 145.33 | 145.33 | 145.33 | 145.33 | 233.66 | 225.56 | 225.56 | 233.66 | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2009 | 364,707 | 185.99 | 185.99 | 185.99 | 185.99 | 185.99 | 145.33 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2010 | 470,815 | 233.66 | 233.66 | 233.66 | 233.66 | 233.66 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2011 | 553,243 | 454.99 | 454.99 | 454.99 | 454.99 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2012 | 709,816 | 33.65 | 555.00 | 145.66 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2013 | 739,036 | 888.96 | 655.00 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
2014 | 861,513 | 383.00 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
Any help will be great. Thanks!
I think this will help define your cohorts a bit easier - As long as you can assume that the first year of data is your first year of business, otherwise it's hard to define your cohorts.
Steps are:
1. Summarize by customer/year level.
2. Sort by customer/year. First entry for customer/year defines the cohort.
3. Create a year variable that would identify the rows.
4. Summarize again at a cohort/year level.
Hope this helps.
data have;
format transaction_date date9.;
input CUSTOMER_ID transaction_date : anydtdte. total purchase_orders;
cards;
98 2-May-03 52 1
324 31-Mar-01 316.5 1
386 14-Sep-03 52.5 1
409 13-Jan-01 44.99 1
409 12-Jun-01 52.5 1
430 8-Dec-03 65 1
469 19-Mar-03 37.5 1
469 4-Nov-03 102.96 1
469 17-Nov-03 29.99 1
527 14-Dec-03 14.99 1
764 31-Jul-03 124.97 1
959 13-Aug-02 92.49 1
1119 17-Dec-03 450 1
1152 14-May-02 85 2
1152 18-Oct-02 162.99 1
1152 24-Apr-03 98.49 1
1152 12-Jun-03 52.5 1
1189 11-Jul-01 41.98 1
;
run;
proc means data=have nway noprint;
class customer_id transaction_date;
format transaction_date year4.;
output out=customer_level sum(total)=orders_total sum(purchase_orders)=orders_n;
run;
proc sort data=customer_level;
by customer_id transaction_date;
run;
data customer_level;
set customer_level;
by customer_id;
retain cohort;
if first.customer_id then Year=1;
if first.customer_id then cohort=year(transaction_date);
else year+1;
run;
proc means data=customer_level nway noprint;
class cohort year;
output out=output_level sum(orders_total)=orders_total sum(orders_n)=orders_n;
run;
I think this will help define your cohorts a bit easier - As long as you can assume that the first year of data is your first year of business, otherwise it's hard to define your cohorts.
Steps are:
1. Summarize by customer/year level.
2. Sort by customer/year. First entry for customer/year defines the cohort.
3. Create a year variable that would identify the rows.
4. Summarize again at a cohort/year level.
Hope this helps.
data have;
format transaction_date date9.;
input CUSTOMER_ID transaction_date : anydtdte. total purchase_orders;
cards;
98 2-May-03 52 1
324 31-Mar-01 316.5 1
386 14-Sep-03 52.5 1
409 13-Jan-01 44.99 1
409 12-Jun-01 52.5 1
430 8-Dec-03 65 1
469 19-Mar-03 37.5 1
469 4-Nov-03 102.96 1
469 17-Nov-03 29.99 1
527 14-Dec-03 14.99 1
764 31-Jul-03 124.97 1
959 13-Aug-02 92.49 1
1119 17-Dec-03 450 1
1152 14-May-02 85 2
1152 18-Oct-02 162.99 1
1152 24-Apr-03 98.49 1
1152 12-Jun-03 52.5 1
1189 11-Jul-01 41.98 1
;
run;
proc means data=have nway noprint;
class customer_id transaction_date;
format transaction_date year4.;
output out=customer_level sum(total)=orders_total sum(purchase_orders)=orders_n;
run;
proc sort data=customer_level;
by customer_id transaction_date;
run;
data customer_level;
set customer_level;
by customer_id;
retain cohort;
if first.customer_id then Year=1;
if first.customer_id then cohort=year(transaction_date);
else year+1;
run;
proc means data=customer_level nway noprint;
class cohort year;
output out=output_level sum(orders_total)=orders_total sum(orders_n)=orders_n;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.