In the following file, I need to identify any "Paid" that happens after the first "Shipped", in terms of both order and time. For example, for OrderID#1, the third record counts as it occurs after at least one shipment takes place. OrderID#2 does not have any record that counts because "Paid" cannot happen on the same day as "Shipped". For OrderID#4, its third record counts as "Paid" is supposed to occur after at least one "Shipped" and it needs be on the days after "Shipped". The last column "Completed" is my target.
Any suggestions?
OrderID | Shipped | Paid | date | Completed |
1 | 0 | 1 | 1/1/2014 | 0 |
1 | 1 | 0 | 1/2/2014 | 0 |
1 | 1 | 1 | 1/5/2014 | 1 |
2 | 1 | 1 | 1/2/2014 | 0 |
2 | 1 | 0 | 1/4/2014 | 0 |
3 | 1 | 0 | 1/5/2014 | 0 |
3 | 1 | 1 | 1/6/2014 | 1 |
3 | 1 | 0 | 1/8/2014 | 0 |
4 | 1 | 0 | 2/2/2014 | 0 |
4 | 1 | 1 | 2/2/2014 | 0 |
4 | 0 | 1 | 2/5/2014 | 1 |
4 | 1 | 0 | 2/6/2014 | 0 |
Thanks a bunch!
Lizi
A Proc SQL alternative if your data is not too big: (raw data stoled from Ksharp):
data have;
infile cards expandtabs truncover;
input OrderID Shipped Paid date : mmddyy10.;
format date mmddyy10.;
cards;
1 0 1 1/1/2014
1 1 0 1/2/2014
1 1 1 1/5/2014
2 1 1 1/2/2014
2 1 0 1/4/2014
3 1 0 1/5/2014
3 1 1 1/6/2014
3 1 0 1/8/2014
4 1 0 2/2/2014
4 1 1 2/2/2014
4 0 1 2/5/2014
4 1 0 2/6/2014
;
run;
proc sql;
create table want as
select *, exists(select * from have where shipped =1 and orderid=a.orderid
and a.paid=1 having min(date) < a.date) as completed
from have a
order by a.orderid, a.date, a.paid
;
quit;
Why 4 shouldn't be
OrderID | Shipped | Paid | date | Completed |
1 | 0 | 1 | 1/1/2014 | 0 |
1 | 1 | 0 | 1/2/2014 | 0 |
1 | 1 | 1 | 1/5/2014 | 1 |
2 | 1 | 1 | 1/2/2014 | 0 |
2 | 1 | 0 | 1/4/2014 | 0 |
3 | 1 | 0 | 1/5/2014 | 0 |
3 | 1 | 1 | 1/6/2014 | 1 |
3 | 1 | 0 | 1/8/2014 | 0 |
4 | 1 | 0 | 2/2/2014 | 0 |
4 | 1 | 1 | 2/2/2014 | 0 |
4 | 0 | 1 | 2/5/2014 | 1 |
4 | 1 | 0 | 2/6/2014 | 0 |
data have; infile cards expandtabs truncover; input OrderID Shipped Paid date : mmddyy10.; format date mmddyy10.; cards; 1 0 1 1/1/2014 1 1 0 1/2/2014 1 1 1 1/5/2014 2 1 1 1/2/2014 2 1 0 1/4/2014 3 1 0 1/5/2014 3 1 1 1/6/2014 3 1 0 1/8/2014 4 1 0 2/2/2014 4 1 1 2/2/2014 4 0 1 2/5/2014 4 1 0 2/6/2014 ; run; data want; set have; by OrderID; retain found flag .; if first.OrderID then do;found=.; flag=.;end; if found and Paid and not flag then do;Completed=1; flag=1; end; else Completed=0; if Shipped then found=1; drop found flag; run;
Xia Keshan
Thank you so much, Keshan! But I realized that the time issue is not taken care of in your program. I have to consider both order and time.
A Proc SQL alternative if your data is not too big: (raw data stoled from Ksharp):
data have;
infile cards expandtabs truncover;
input OrderID Shipped Paid date : mmddyy10.;
format date mmddyy10.;
cards;
1 0 1 1/1/2014
1 1 0 1/2/2014
1 1 1 1/5/2014
2 1 1 1/2/2014
2 1 0 1/4/2014
3 1 0 1/5/2014
3 1 1 1/6/2014
3 1 0 1/8/2014
4 1 0 2/2/2014
4 1 1 2/2/2014
4 0 1 2/5/2014
4 1 0 2/6/2014
;
run;
proc sql;
create table want as
select *, exists(select * from have where shipped =1 and orderid=a.orderid
and a.paid=1 having min(date) < a.date) as completed
from have a
order by a.orderid, a.date, a.paid
;
quit;
Thanks a lot, Kuo.
Your way is pretty intuitive to me as I am still more familiar with small datahase management than the SAS retain function.
I just realize Shipped and Paid should not be the same day.
data have; infile cards expandtabs truncover; input OrderID Shipped Paid date : mmddyy10.; format date mmddyy10.; cards; 1 0 1 1/1/2014 1 1 0 1/2/2014 1 1 1 1/5/2014 2 1 1 1/2/2014 2 1 0 1/4/2014 3 1 0 1/5/2014 3 1 1 1/6/2014 3 1 0 1/8/2014 4 1 0 2/2/2014 4 1 1 2/2/2014 4 0 1 2/5/2014 4 1 0 2/6/2014 ; run; data want; set have; by OrderID; retain found flag _date .; if first.OrderID then do;found=.; flag=.;_date=.;end; if found and Paid and not flag and date ne _date then do;Completed=1; flag=1; end; else Completed=0; if Shipped then do;found=1;_date=date;end; drop found flag _date; run;
Xia Keshan
Keshan, it is more the matter of combination of timing and order than timing only or order only. Say if I have OrderID#5 that has the following shipping and payment timing, the program above would not be able to catch this order (Completed=0)
OrderID | Shipped | Paid | date | Completed |
5 | 1 | 0 | 6/3/2014 | 0 |
5 | 1 | 0 | 6/5/2014 | 0 |
5 | 1 | 1 | 6/5/2014 | 1 |
5 | 1 | 0 | 6/5/2014 | 0 |
However, as you can see, Order #5 does have Completed=1.
In sum, I am looking for the record per order that indicates "Paid" occurs after (both in terms of order and time) at least one "Shipped" takes place.
Thanks again!
It is more complicated than I image .This time could work.
data have;
infile cards expandtabs truncover;
input OrderID Shipped Paid date : mmddyy10.;
format date mmddyy10.;
cards;
1 0 1 1/1/2014
1 1 0 1/2/2014
1 1 1 1/5/2014
2 1 1 1/2/2014
2 1 0 1/4/2014
3 1 0 1/5/2014
3 1 1 1/6/2014
3 1 0 1/8/2014
4 1 0 2/2/2014
4 1 1 2/2/2014
4 0 1 2/5/2014
4 1 0 2/6/2014
5 1 0 6/3/2014
5 1 0 6/5/2014
5 1 1 6/5/2014
5 1 0 6/5/2014
;
run;
data want;
set have;
by OrderID;
retain found flag _date .;
if first.OrderID then do;found=.; flag=.;_date=.;end;
if found and Paid and not flag and date ne _date then do;Completed=1; flag=1; end;
else Completed=0;
if not found and Shipped then do;found=1;_date=date;end;
drop found flag _date;
run;
Xia Keshan
I think you got it right. Thank you for spending so much time, Keshan.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.