BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cara_catus
Fluorite | Level 6

New SAS user! First post!

I’m trying to create an output which calculatesO. The cohorts are defined by the year which the customer shopped for the first time. So, cohort 2003 will only consist of those that shopped for the first time in 2003 and not include those that shopped for the first time in 2000, 2001, 2002.  I'm having trouble figuring out a more efficient way to do this and calculate the final metric across the years.


The data set I start out with, looks like this.

CUSTOMER_IDtransaction_datetotalpurchase_orders
982-May-03521
32431-Mar-01316.51
38614-Sep-0352.51
40913-Jan-0144.991
40912-Jun-0152.51
4308-Dec-03651
46919-Mar-0337.51
4694-Nov-03102.961
46917-Nov-0329.991
52714-Dec-0314.991
76431-Jul-03124.971
95913-Aug-0292.491
111917-Dec-034501
115214-May-02852
115218-Oct-02162.991
115224-Apr-0398.491
115212-Jun-0352.51
118911-Jul-0141.981

Average order volume is calculated by:

total / purchase_orders

To produce the cohort groups, I'm creating separate transaction data sets for each year and then and then removing the customer_ids which shopped the previous years.

proc sql;

create table ltv.ID_2000 as

select distinct

year(transaction_date) as year, 

CUSTOMER_ID,

count( distinct CUSTOMER_ID ) as count

from ltv._2000; quit;

proc sql;

create table ltv.ID_2001 as

select distinct

year(transaction_date) as year, 

CUSTOMER_ID,

count( distinct CUSTOMER_ID ) as count

from ltv._2001

where  CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000);quit;

proc sql;

create table ltv.ID_2002 as

select distinct

year(transaction_date) as year, 

CUSTOMER_ID,

count( distinct CUSTOMER_ID ) as count

from ltv._2002

where  CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000)

       and CUSTOMER_ID  not in ( select distinct CUSTOMER_ID from ltv._2001);quit;

proc sql;

create table ltv.ID_2003 as

select distinct

year(transaction_date) as year,

CUSTOMER_ID,

count( distinct CUSTOMER_ID ) as count

from ltv._2003

where  CUSTOMER_ID not in ( select distinct CUSTOMER_ID from ltv._2000)

       and CUSTOMER_ID  not in ( select distinct CUSTOMER_ID from ltv._2001)

          and CUSTOMER_ID  not in ( select distinct CUSTOMER_ID from ltv._2002);quit;


Then I have to join all the data sets together .

proc sql;

create table ltv.join_master as

select distinct

  1. a.*,
  2. a.CUSTOMER_ID=c.CUSTOMER_ID as _2002,
  3. a.CUSTOMER_ID=d.CUSTOMER_ID as _2003,
  4. a.CUSTOMER_ID=e.CUSTOMER_ID as _2004,
  5. a.CUSTOMER_ID=f.CUSTOMER_ID as _2005

from  ltv.ID_2001 a

left join ltv._2002 c

on a.CUSTOMER_ID=c.CUSTOMER_ID

left join ltv._2003 d

on a.CUSTOMER_ID=d.CUSTOMER_ID

left join ltv._2004 e

on a.CUSTOMER_ID=e.CUSTOMER_ID

left join ltv._2005 f

on a.CUSTOMER_ID=f.CUSTOMER_ID;

quit;


Any this is were I'm stuck. 


The final output I'm looking for should look something like this.  A variable for customer cohort (unique first time customers) count and then and then average order volume variables for each consecutive year for that cohort. 


Year Of First Purchase Customer Cohort  CountAverage Order Volume bought in Year + 1Average Order Volume bought in Year + 2Average Order Volume bought in Year + 3Average Order Volume bought in Year + 4Average Order Volume bought in Year + 5Average Order Volume bought in Year + 6Average Order Volume bought in Year + 7Average Order Volume bought in Year + 8Average Order Volume bought in Year + 9Average Order Volume bought in Year + 10Average Order Volume bought in Year + 11Average Order Volume bought in Year + 12Average Order Volume bought in Year + 13Average Order Volume bought in Year + 14Average Order Volume bought in Year + 15
200011,865120.55120.55120.55120.55222.00555.00555.00222.00222.00222.000.08233.66233.66454.99333.33
200164,209225.56225.56225.56225.56225.56684.00684.00225.56225.56225.560.07454.99454.990.04N/A
200285,932145.33145.33145.33145.33145.33656.44656.44145.33145.33145.330.06120.55120.55N/AN/A
2003142,229185.99185.99185.99185.99185.99555.65555.65185.99185.99185.990.05225.56N/AN/AN/A
2004194,213233.66233.66233.66233.66233.66222.00222.00233.66233.66233.660.03N/AN/AN/AN/A
2005219,872454.99454.99454.99454.99225.56225.56225.56225.56225.56225.56N/AN/AN/AN/AN/A
2006248,812120.55120.55120.55120.55145.33145.33145.33145.33145.33N/AN/AN/AN/AN/AN/A
2007264,167225.56225.56225.56225.56185.99222.00222.00185.99N/AN/AN/AN/AN/AN/AN/A
2008330,208145.33145.33145.33145.33233.66225.56225.56233.66N/AN/AN/AN/AN/AN/AN/A
2009364,707185.99185.99185.99185.99185.99145.33N/AN/AN/AN/AN/AN/AN/AN/AN/A
2010470,815233.66233.66233.66233.66233.66N/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
2011553,243454.99454.99454.99454.99N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
2012709,81633.65555.00145.66N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
2013739,036888.96655.00N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
2014861,513383.00N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A



Any help will be great. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

I think this will help define your cohorts a bit easier - As long as you can assume that the first year of data is your first year of business, otherwise it's hard to define your cohorts.

Steps are:

1. Summarize by customer/year level.

2. Sort by customer/year. First entry for customer/year defines the cohort.

3. Create a year variable that would identify the rows.

4. Summarize again at a cohort/year level.

Hope this helps.

data have;

format transaction_date date9.;

input CUSTOMER_ID transaction_date : anydtdte. total purchase_orders;

cards;

98 2-May-03 52 1

324 31-Mar-01 316.5 1

386 14-Sep-03 52.5 1

409 13-Jan-01 44.99 1

409 12-Jun-01 52.5 1

430 8-Dec-03 65 1

469 19-Mar-03 37.5 1

469 4-Nov-03 102.96 1

469 17-Nov-03 29.99 1

527 14-Dec-03 14.99 1

764 31-Jul-03 124.97 1

959 13-Aug-02 92.49 1

1119 17-Dec-03 450 1

1152 14-May-02 85 2

1152 18-Oct-02 162.99 1

1152 24-Apr-03 98.49 1

1152 12-Jun-03 52.5 1

1189 11-Jul-01 41.98 1

;

run;

proc means data=have nway noprint;

class customer_id transaction_date;

format transaction_date year4.;

output out=customer_level sum(total)=orders_total  sum(purchase_orders)=orders_n;

run;

proc sort data=customer_level;

by customer_id transaction_date;

run;

data customer_level;

set customer_level;

by customer_id;

retain cohort;

if first.customer_id then Year=1;

if first.customer_id then cohort=year(transaction_date);

else year+1;

run;

proc means data=customer_level nway noprint;

class cohort year;

output out=output_level sum(orders_total)=orders_total sum(orders_n)=orders_n;

run;

View solution in original post

1 REPLY 1
Reeza
Super User

I think this will help define your cohorts a bit easier - As long as you can assume that the first year of data is your first year of business, otherwise it's hard to define your cohorts.

Steps are:

1. Summarize by customer/year level.

2. Sort by customer/year. First entry for customer/year defines the cohort.

3. Create a year variable that would identify the rows.

4. Summarize again at a cohort/year level.

Hope this helps.

data have;

format transaction_date date9.;

input CUSTOMER_ID transaction_date : anydtdte. total purchase_orders;

cards;

98 2-May-03 52 1

324 31-Mar-01 316.5 1

386 14-Sep-03 52.5 1

409 13-Jan-01 44.99 1

409 12-Jun-01 52.5 1

430 8-Dec-03 65 1

469 19-Mar-03 37.5 1

469 4-Nov-03 102.96 1

469 17-Nov-03 29.99 1

527 14-Dec-03 14.99 1

764 31-Jul-03 124.97 1

959 13-Aug-02 92.49 1

1119 17-Dec-03 450 1

1152 14-May-02 85 2

1152 18-Oct-02 162.99 1

1152 24-Apr-03 98.49 1

1152 12-Jun-03 52.5 1

1189 11-Jul-01 41.98 1

;

run;

proc means data=have nway noprint;

class customer_id transaction_date;

format transaction_date year4.;

output out=customer_level sum(total)=orders_total  sum(purchase_orders)=orders_n;

run;

proc sort data=customer_level;

by customer_id transaction_date;

run;

data customer_level;

set customer_level;

by customer_id;

retain cohort;

if first.customer_id then Year=1;

if first.customer_id then cohort=year(transaction_date);

else year+1;

run;

proc means data=customer_level nway noprint;

class cohort year;

output out=output_level sum(orders_total)=orders_total sum(orders_n)=orders_n;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 442 views
  • 0 likes
  • 2 in conversation