Hi all,
The mostly Base SAS script below calculates the average time between purchases for a customer based on transaction dates in one variable. What I would like very much to do is rewrite this using just Proc SQL and to do this all in just one pass. Here is the full code plus I have broken the script out step by step below.
proc sort data=have ; by mstr_customer_id transaction_date order_id ;run; proc sort data=have out=purchase (keep=mstr_customer_id order_id transaction_date ) nodupkey; where transaction_sub_type=1 ; by mstr_customer_id order_id; ;run; data single multiple; set purchase; by mstr_customer_id; if first.mstr_customer_id=1 and last.mstr_customer_id=1 then output single; else output multiple;run; proc sort data=multiple; by mstr_customer_id transaction_date;run; data multiple1; format first_multiple_td_p last_multiple_td_p ; set multiple; format l_transaction_date date8.; retain first_multiple_td_p; by mstr_customer_id transaction_date; l_transaction_date = lag(transaction_date); if first.mstr_customer_id then do; l_transaction_date = .; date_diff=0; end; if not first.mstr_customer_id then do; date_diff = transaction_date-l_transaction_date; end; if first.mstr_customer_id then first_multiple_td_p = transaction_date; if last.mstr_customer_id then last_multiple_td_p = transaction_date; if last.mstr_customer_id then first_last_diff_p = last_multiple_td_p-first_multiple_td_p;run; proc sql; create table avg_bw_purchase as select distinct mstr_customer_id, transaction_date, date_diff, first_last_diff_p as first_last_diff_pur, avg(date_diff) as avg_time_bw_pur from multiple1 group by mstr_customer_id;quit; proc sort data=avg_bw_purchase; by mstr_customer_id transaction_date;run; data avg_bw_purchase(drop=date_diff transaction_date); set avg_bw_purchase; by mstr_customer_id transaction_date; if first_last_diff_pur>=0 then output;run;
Again, I need to rewrite the above his using just Proc SQL and to do this all in just one pass. Thanks!!
The data I start with looks like this ....
MSTR_CUSTOMER_ID | TRANSACTION_DATE | ORDER_ID | TRANSACTION_SUB_TYPE |
46862 | 12-Aug-17 | 00500009201708120000257749 | 1 |
46862 | 12-Aug-17 | 00500009201708120000257749 | 1 |
46862 | 25-Nov-17 | 00400027201711250000402816 | 1 |
46862 | 25-Nov-17 | 00400027201711250000402816 | 1 |
46862 | 25-Nov-17 | 00400027201711250000402816 | 1 |
46862 | 25-Nov-17 | 00400027201711250000402816 | 1 |
46862 | 25-Nov-17 | 00400027201711250000402816 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271298 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271298 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271298 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271298 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271299 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271299 | 1 |
54691 | 10-Jul-17 | 00100140201707100000271299 | 1 |
54691 | 22-Jul-17 | 00400140201707220000312323 | 1 |
54691 | 22-Jul-17 | 00400140201707220000312323 | 1 |
54691 | 22-Jul-17 | 00700140201707220000397098 | 1 |
54691 | 26-Nov-17 | 00300140201711260000225399 | 1 |
54691 | 26-Nov-17 | 00300140201711260000225399 | 1 |
54691 | 26-Nov-17 | 00300140201711260000225399 | 1 |
54691 | 26-Nov-17 | 00300140201711260000225399 | 1 |
54691 | 26-Nov-17 | 00300140201711260000225399 | 1 |
89553 | 7-May-17 | 00200013201705070000243369 | 1 |
92118 | 26-May-17 | 00300273201705260000126004 | 1 |
143541 | 28-May-17 | 00200091201705280000399615 | 1 |
143541 | 28-May-17 | 00200091201705280000399615 | 1 |
143541 | 12-Jun-17 | 00100091201706120000361014 | 1 |
143541 | 14-Jun-17 | 00300091201706140000338020 | 1 |
144239 | 10-Sep-17 | 00400202201709100000007705 | 1 |
144239 | 10-Sep-17 | 00400202201709100000007705 | 1 |
162339 | 19-Sep-17 | 00600092201709190000442273 | 1 |
Step 1
proc sort data=have; by mstr_customer_id transaction_date order_id ;run; proc sort data=have out=purchase (keep=mstr_customer_id order_id transaction_date ) nodupkey; where transaction_sub_type=1 ; by mstr_customer_id order_id; ;run;
TRANSACTION_DATE | MSTR_CUSTOMER_ID | ORDER_ID |
12-Aug-17 | 46862 | 00500009201708120000257749 |
25-Nov-17 | 46862 | 00400027201711250000402816 |
10-Jul-17 | 54691 | 00100140201707100000271298 |
10-Jul-17 | 54691 | 00100140201707100000271299 |
22-Jul-17 | 54691 | 00400140201707220000312323 |
22-Jul-17 | 54691 | 00700140201707220000397098 |
26-Nov-17 | 54691 | 00300140201711260000225399 |
10-Dec-17 | 54691 | 00900140201712100000334607 |
28-May-17 | 143541 | 00200091201705280000399615 |
12-Jun-17 | 143541 | 00100091201706120000361014 |
14-Jun-17 | 143541 | 00300091201706140000338020 |
Step #2 : separate multiple purchasers from single purchasers
data single multiple; set purchase; by mstr_customer_id; if first.mstr_customer_id=1 and last.mstr_customer_id=1 then output single; else output multiple;run; proc sort data=multiple; by mstr_customer_id transaction_date;run;
TRANSACTION_DATE | MSTR_CUSTOMER_ID | ORDER_ID |
12-Aug-17 | 46862 | 00500009201708120000257749 |
25-Nov-17 | 46862 | 00400027201711250000402816 |
10-Jul-17 | 54691 | 00100140201707100000271298 |
10-Jul-17 | 54691 | 00100140201707100000271299 |
22-Jul-17 | 54691 | 00400140201707220000312323 |
22-Jul-17 | 54691 | 00700140201707220000397098 |
26-Nov-17 | 54691 | 00300140201711260000225399 |
10-Dec-17 | 54691 | 00900140201712100000334607 |
28-May-17 | 143541 | 00200091201705280000399615 |
12-Jun-17 | 143541 | 00100091201706120000361014 |
14-Jun-17 | 143541 | 00300091201706140000338020 |
Step #3 :Calculate date_diff & first_last_diff_p
data multiple1; format first_multiple_td_p last_multiple_td_p ; set multiple; format l_transaction_date date8.; retain first_multiple_td_p; by mstr_customer_id transaction_date; l_transaction_date = lag(transaction_date); if first.mstr_customer_id then do; l_transaction_date = .; date_diff=0; end; if not first.mstr_customer_id then do; date_diff = transaction_date-l_transaction_date; end; if first.mstr_customer_id then first_multiple_td_p = transaction_date; if last.mstr_customer_id then last_multiple_td_p = transaction_date; if last.mstr_customer_id then first_last_diff_p = last_multiple_td_p-first_multiple_td_p;run;
first_multiple_td_p | last_multiple_td_p | TRANSACTION_DATE | MSTR_CUSTOMER_ID | ORDER_ID | l_transaction_date | date_diff | first_last_diff_p |
11-Aug-57 | 12-Aug-17 | 46862 | 00500009201708120000257749 | 0 | |||
11-Aug-57 | 24-Nov-57 | 25-Nov-17 | 46862 | 00400027201711250000402816 | 12-Aug-17 | 105 | 105 |
9-Jul-57 | 10-Jul-17 | 54691 | 00100140201707100000271298 | 0 | |||
9-Jul-57 | 10-Jul-17 | 54691 | 00100140201707100000271299 | 10-Jul-17 | 0 | ||
9-Jul-57 | 22-Jul-17 | 54691 | 00400140201707220000312323 | 10-Jul-17 | 12 | ||
9-Jul-57 | 22-Jul-17 | 54691 | 00700140201707220000397098 | 22-Jul-17 | 0 | ||
9-Jul-57 | 26-Nov-17 | 54691 | 00300140201711260000225399 | 22-Jul-17 | 127 | ||
9-Jul-57 | 9-Dec-57 | 10-Dec-17 | 54691 | 00900140201712100000334607 | 26-Nov-17 | 14 | 153 |
27-May-57 | 28-May-17 | 143541 | 00200091201705280000399615 | 0 | |||
27-May-57 | 12-Jun-17 | 143541 | 00100091201706120000361014 | 28-May-17 | 15 | ||
27-May-57 | 13-Jun-57 | 14-Jun-17 | 143541 | 00300091201706140000338020 | 12-Jun-17 | 2 | 17 |
Step #4 :Aggregate
proc sql; create table avg_bw_purchase as select distinct mstr_customer_id, transaction_date, date_diff, first_last_diff_p as first_last_diff_pur, avg(date_diff) as avg_time_bw_pur from multiple1 group by mstr_customer_id;quit;
MSTR_CUSTOMER_ID | TRANSACTION_DATE | date_diff | first_last_diff_pur | avg_time_bw_pur |
46862 | 12Aug2017 | 0 | 52.5 | |
46862 | 25Nov2017 | 105 | 105 | 52.5 |
54691 | 10Jul2017 | 0 | 25.5 | |
54691 | 22Jul2017 | 0 | 25.5 | |
54691 | 22Jul2017 | 12 | 25.5 | |
54691 | 26Nov2017 | 127 | 25.5 | |
54691 | 10Dec2017 | 14 | 153 | 25.5 |
143541 | 28May2017 | 0 | 5.666666667 | |
143541 | 12Jun2017 | 15 | 5.666666667 | |
143541 | 14Jun2017 | 2 | 17 | 5.666666667 |
Step #5 :Final - one row for each customer
proc sort data=avg_bw_purchase; by mstr_customer_id transaction_date;run; data avg_bw_purchase(drop=date_diff transaction_date); set avg_bw_purchase; by mstr_customer_id transaction_date; if first_last_diff_pur>=0 then output;run;
MSTR_CUSTOMER_ID | first_last_diff_pur | avg_time_bw_pur |
46862 | 105 | 52.5 |
54691 | 153 | 25.5 |
143541 | 17 | 5.666667 |
Again, any help rewriting this just using Proc SQL and in one pass will be greatly appreciated. Thanks!
I see. You're not interested in items per order, but the number of days between orders (which may contain 1 or 100 items):
proc sql noprint;
create table want as
select distinct a.MSTR_CUSTOMER_ID,
max(TRANSACTION_DATE) - min(TRANSACTION_DATE) as days,
b.orders_per_ID,
(max(TRANSACTION_DATE) - min(TRANSACTION_DATE))/b.orders_per_ID as average
from have as a, (select MSTR_CUSTOMER_ID,
count (distinct order_id) as orders_per_id
from have
group by MSTR_CUSTOMER_ID) as b
where a.MSTR_CUSTOMER_ID = b.MSTR_CUSTOMER_ID
group by a.MSTR_CUSTOMER_ID
having orders_per_id > 1
;
quit;
Also, it looks like some of your counts are off. I see that ID 143541 has 3 distinct order ID's, while you have 4. Also you have 153 days for the difference for 54691, while there are only 139 days between 26NOV2017 and 10JUL2017.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
This what you need? I'm dividing between the number of orders, but you may want to do number of orders -1.
If you have 2 orders 10 days apart do you expect the difference to be 10 days or 10 days/2 orders= 5 days? You can change "count(*)" to "max((count(*) -1, 1)" if desired (not tested). Use the max function so that if there is only 1 order, you don't divide by 0.
data have;
length MSTR_CUSTOMER_ID 8. TRANSACTION_DATE 8. ORDER_ID $50. TRANSACTION_SUB_TYPE 8.;
input MSTR_CUSTOMER_ID TRANSACTION_DATE date11. ORDER_ID TRANSACTION_SUB_TYPE;
format transaction_date date11.;
datalines;
46862 12-Aug-17 00500009201708120000257749 1
46862 12-Aug-17 00500009201708120000257749 1
46862 25-Nov-17 00400027201711250000402816 1
46862 25-Nov-17 00400027201711250000402816 1
46862 25-Nov-17 00400027201711250000402816 1
46862 25-Nov-17 00400027201711250000402816 1
46862 25-Nov-17 00400027201711250000402816 1
54691 10-Jul-17 00100140201707100000271298 1
54691 10-Jul-17 00100140201707100000271298 1
54691 10-Jul-17 00100140201707100000271298 1
54691 10-Jul-17 00100140201707100000271298 1
54691 10-Jul-17 00100140201707100000271299 1
54691 10-Jul-17 00100140201707100000271299 1
54691 10-Jul-17 00100140201707100000271299 1
54691 22-Jul-17 00400140201707220000312323 1
54691 22-Jul-17 00400140201707220000312323 1
54691 22-Jul-17 00700140201707220000397098 1
54691 26-Nov-17 00300140201711260000225399 1
54691 26-Nov-17 00300140201711260000225399 1
54691 26-Nov-17 00300140201711260000225399 1
54691 26-Nov-17 00300140201711260000225399 1
54691 26-Nov-17 00300140201711260000225399 1
89553 7-May-17 00200013201705070000243369 1
92118 26-May-17 00300273201705260000126004 1
143541 28-May-17 0020009120170528000039961 51
143541 28-May-17 0020009120170528000039961 51
143541 12-Jun-17 0010009120170612000036101 41
143541 14-Jun-17 0030009120170614000033802 01
144239 10-Sep-17 0040020220170910000000770 51
144239 10-Sep-17 0040020220170910000000770 51
162339 19-Sep-17 0060009220170919000044227 31
;;;;
run;
proc sql noprint;
create table want as
select MSTR_CUSTOMER_ID, max(TRANSACTION_DATE) - min(TRANSACTION_DATE) as days, count (*) as orders,
(max(TRANSACTION_DATE) - min(TRANSACTION_DATE))/count(*) as average
from have
group by MSTR_CUSTOMER_ID;
quit;
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Thanks noling. It looks like that's taking the average time between order id and not transaction date
MSTR_CUSTOMER_ID | days | orders | average |
46862 | 105 | 7 | 15 |
54691 | 153 | 24 | 6.375 |
89553 | 0 | 1 | 0 |
92118 | 0 | 1 | 0 |
143541 | 17 | 4 | 4.25 |
144239 | 0 | 2 | 0 |
162339 | 0 | 1 | 0 |
All customer with just one purchase are excluded.
So if I customer has more than one purchase I'm looking for average time between each distinct transaction date.
MSTR_CUSTOMER_ID | first_last_diff_pur | avg_time_bw_pur |
46862 | 105 | 52.5 |
54691 | 153 | 25.5 |
143541 | 17 | 5.666667 |
Thanks noling.
It looks like that's taking the average time between order id and not transaction date
MSTR_CUSTOMER_ID | days | orders | average |
46862 | 105 | 7 | 15 |
54691 | 153 | 24 | 6.375 |
89553 | 0 | 1 | 0 |
92118 | 0 | 1 | 0 |
143541 | 17 | 4 | 4.25 |
144239 | 0 | 2 | 0 |
162339 | 0 | 1 | 0 |
All customer with just one purchase are excluded.
So if I customer has more than one purchase I'm looking for average time between each distinct transaction date.
MSTR_CUSTOMER_ID | first_last_diff_pur | avg_time_bw_pur |
46862 | 105 | 52.5 |
54691 | 153 | 25.5 |
143541 | 17 | 5.666667 |
I see. You're not interested in items per order, but the number of days between orders (which may contain 1 or 100 items):
proc sql noprint;
create table want as
select distinct a.MSTR_CUSTOMER_ID,
max(TRANSACTION_DATE) - min(TRANSACTION_DATE) as days,
b.orders_per_ID,
(max(TRANSACTION_DATE) - min(TRANSACTION_DATE))/b.orders_per_ID as average
from have as a, (select MSTR_CUSTOMER_ID,
count (distinct order_id) as orders_per_id
from have
group by MSTR_CUSTOMER_ID) as b
where a.MSTR_CUSTOMER_ID = b.MSTR_CUSTOMER_ID
group by a.MSTR_CUSTOMER_ID
having orders_per_id > 1
;
quit;
Also, it looks like some of your counts are off. I see that ID 143541 has 3 distinct order ID's, while you have 4. Also you have 153 days for the difference for 54691, while there are only 139 days between 26NOV2017 and 10JUL2017.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.