Re: Proc Sql Question on Rolling Dates with Conditions

NIUNIU · Posted 06-06-2011 01:08 AM

I have been thinking using Proc Sql self-join on the question below for many days. Somehow, i just cannot get the exact results. I really appreciate anyone can help me to resolve the puzzle. The general logic is "There are more than 2 test use different ID within 2 days from the same account and the total amount by these tests is above 4000 dollars. If so, the flag is 1, else the flag is 0". I want to produce the last column flag.

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount hit;
format date mmddyy10. time time9.;
datalines;

1 387000 6/10/2010 12:00:00 1 13.15 2000 1
2 387000 6/9/2010 12:00:00 1 13.15 3000 0
3 387000 6/8/2010 12:00:00 1 13.14 2000 0
4 387000 6/6/2010 11:00:00 1 13.15 2000 1
5 387000 6/5/2010 10:00:00 1 13.14 2000 0
6 387000 6/4/2010 9:00:00 1 13.13 1000 0
7 387000 6/4/2010 8:00:00 1 13.13 500 0
8 386000 5/9/2010 10:00:00 1 12.12 5000 0
9 386000 5/2/2010 9:00:00 1 12.14 2000 0
10 386000 5/1/2010 8:00:00 1 12.13 2000 0
11 385000 3/3/2010 10:00:00 1 11.13 5000 1
12 385000 3/1/2010 9:00:00 1 11.12 3000 0
13 385000 3/1/2010 8:00:00 1 11.11 3000 0
;
run;

For example:
In obs 1, account 387000, within 2 days, (06/10 /10 -06/08/10), there are three test adopted different ID (13.15 and 13.14), and the total money is more than 4000 (2000+3000+2000), so the flag is 1
In obs 2, account 387000, within 2 days, (06/9/10 -06/07/10), there are only two test adopted different ID, even though the money is above $4000. The flag is 0
in Obs 3, account 387000, within 2 days, (06/8/10 -06/06/10), there are only two test adopted different ID, and the money is not more than $4000. The flag is 0
in Obs 4, account 387000, within 2 days, (06/6/10 -06/04/10), there are four usage adopted different ID (13.15, 13.14, and13.13), and the money is more than $4000 (it is 5500). The flag is 1

The flag has to satisfy the following points:

1) In the same account
2) within 2 days
3) more than 2 test (each observation is one test)
4) the sum of the amount by these test is above $4000
5) used different ID.

Thank you very much for your help!

NIUNIU · Posted 06-06-2011 03:14 PM

If I use the following code to run, it will work. However, the run time is so long when the data set is huge with millions of rows. Any way to reduce the code running time?

Thank you very much!

PROC SQL;

SELECT
a.*, 1 AS Flag
FROM
TEMP a
WHERE
EXISTS
(
SELECT *
FROM TEMP b
WHERE
b.Account = a.Account AND
b.date between a.date-2 and b.date
HAVING
SUM(b.Amount) >= 4000
AND COUNT(DISTINCT obs) > 2
)

Ksharp · Posted 06-06-2011 11:58 PM

I used Cartesian Product of sql is more efficient than your sub-query,But it is also not suitable for large table.
[pre]

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount ;
format date mmddyy10. time time9.;
datalines;

1 387000 6/10/2010 12:00:00 1 13.15 2000
2 387000 6/9/2010 12:00:00 1 13.15 3000
3 387000 6/8/2010 12:00:00 1 13.14 2000
4 387000 6/6/2010 11:00:00 1 13.15 2000
5 387000 6/5/2010 10:00:00 1 13.14 2000
6 387000 6/4/2010 9:00:00 1 13.13 1000
7 387000 6/4/2010 8:00:00 1 13.13 500
8 386000 5/9/2010 10:00:00 1 12.12 5000
9 386000 5/2/2010 9:00:00 1 12.14 2000
10 386000 5/1/2010 8:00:00 1 12.13 2000
11 385000 3/3/2010 10:00:00 1 11.13 5000
12 385000 3/1/2010 9:00:00 1 11.12 3000
;
run;
proc sql ;
create table want(drop=_amount count_id count_test) as
select distinct a.*,count(distinct b.id) as count_id,count( b.test) as count_test,sum(b.amount) as _amount,
case when calculated _amount gt 4000 and calculated count_id ge 2 and
calculated count_test gt 2 then 1 else 0 end as flag
from temp as a ,temp as b
where a.account = b.account and b.date between a.date-2 and a.date
group by a.obs
;
quit;
proc sort data=want;
by descending date;
run;

[/pre]

Ksharp

NIUNIU · Posted 06-07-2011 12:59 AM

Hi Ksharp,
Thank you very much! However do you know how to resolve the same date, but different time issue.I think if I add another condition with the date like this will work.

or (b.date=a.date and b.time le a.time)

use "le" because somehow the SAS forum has bugs when produce the less and equal sign.

It did not recognized it.Do you have any suggest how to resolve it.
Thank you. Message was edited by: NIUNIU

Ksharp · Posted 06-07-2011 03:45 AM

Hi.
I don't know sth about your "the same date, but different time issue ".
This issue stands for what ?
Do you mean like this " b.date between a.date-2 and a.date"?
I think your "or (b.date=a.date and b.time le a.time)" is good.

Maybe you need to post some desired output,it will clarify your question.

Ksharp

NIUNIU · Posted 06-07-2011 05:27 AM

Hi Ksharp,

For example, the same dataset, but consider obs 12 and 13.

if I use the following code to produce the flag for same logic but without considering different id. Since obs 12 and 13 has the same date, so the code produce the flag=1 also for obs 13, while the flag really should be 0. The reason is because the code did not provide the time difference.

Your help is high appreciated.

proc sql ;
create table want (drop=_amount count_test) as
select distinct a.*,
/* count(distinct b.id) as count_id,*/
count( b.test) as count_test,
sum(b.amount) as _amount,

case when calculated _amount gt 4000
/* and calculated count_id ge 2 and */
and calculated count_test gt 1
then 1 else 0 end as flag

from temp as a ,temp as b
where a.account = b.account and b.date between a.date-2 and a.date
/* or (b.date=a.date and b.time le a.time)*/
group by a.obs
;
quit;

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount ;
format date mmddyy10. time time9.;
datalines;

1 387000 6/10/2010 12:00:00 1 13.15 2000 1
2 387000 6/9/2010 12:00:00 1 13.15 3000 1
3 387000 6/8/2010 12:00:00 1 13.14 2000 0
4 387000 6/6/2010 11:00:00 1 13.15 2000 1
5 387000 6/5/2010 10:00:00 1 13.14 2000 0
6 387000 6/4/2010 9:00:00 1 13.13 1000 0
7 387000 6/4/2010 8:00:00 1 13.13 500 0
8 386000 5/9/2010 10:00:00 1 12.12 5000 0
9 386000 5/2/2010 9:00:00 1 12.14 2000 0
10 386000 5/1/2010 8:00:00 1 12.13 2000 0
11 385000 3/3/2010 10:00:00 1 11.13 5000 1
12 385000 3/1/2010 9:00:00 1 11.12 3000 1
13 385000 3/1/2010 8:00:00 1 11.11 3000 0
;
run;

Ksharp · Posted 06-07-2011 05:40 AM

Hi.
That would not be a problem. Because I used
[pre]
from temp as a ,temp as b
where a.account = b.account and b.date between a.date-2 and a.date
/* or (b.date=a.date and b.time le a.time)*/
group by a.obs
[/pre]

That obs is unique , so The problem you mentioned would not be appeared.
You can test it by yourself.
If you have other problems ,I will be here tomorrow.
I have to leave now.

Ksharp

NIUNIU · Posted 06-07-2011 10:31 AM

I know it sounds weird. But I tested many times, the issue still cannot resolve when two consecutive date are same, but with different time. The code won't tell the difference.

If I use this logic to test:
"There are more than 1 test within 2 days from the same account and the total amount by these tests is above 4000 dollars. If so, the flag is 1, else the flag is 0".

proc sql ;
create table want (drop=_amount count_test) as
select distinct a.*,
count( b.test) as count_test,
sum(b.amount) as _amount,

case when calculated _amount gt 4000
and calculated count_test gt 1
then 1 else 0 end as flag

from temp as a ,temp as b
where a.account = b.account and b.date between a.date-2 and a.date
group by a.obs
;
quit;

The code will give Obs 13 flag as 1, instead it should be 0. Since there is no date earlier than obs 13 in account 38500. I think the problem caused by this issue is because of the code cannot different the time.

Thank you very much!

Ksharp · Posted 06-07-2011 10:11 PM

Oh.I understand what you mean.
Yes.My code does not care about time.So the last obs will also be 1.
If you want to follow this logic,you can use datetime format to distinct date and time exactly.

[pre]

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount ;
format date mmddyy10. time time9.;
datetime= dhms(date,0,0,time);
format datetime datetime.;
datalines;
1 387000 6/10/2010 12:00:00 1 13.15 2000
2 387000 6/9/2010 12:00:00 1 13.15 3000
3 387000 6/8/2010 12:00:00 1 13.14 2000
4 387000 6/6/2010 11:00:00 1 13.15 2000
5 387000 6/5/2010 10:00:00 1 13.14 2000
6 387000 6/4/2010 9:00:00 1 13.13 1000
7 387000 6/4/2010 8:00:00 1 13.13 500
8 386000 5/9/2010 10:00:00 1 12.12 5000
9 386000 5/2/2010 9:00:00 1 12.14 2000
10 386000 5/1/2010 8:00:00 1 12.13 2000
11 385000 3/3/2010 10:00:00 1 11.13 5000
12 385000 3/1/2010 9:00:00 1 11.12 3000
13 385000 3/1/2010 8:00:00 1 11.11 3000
;
run;
proc sql ;
create table want(drop=_amount count_test) as
select distinct a.*,count( b.test) as count_test,sum(b.amount) as _amount,
case when calculated _amount gt 4000 and
calculated count_test gt 1 then 1 else 0 end as flag
from temp as a ,temp as b
where a.account = b.account and b.datetime between a.datetime-172800 and a.datetime
/*172800 is the number of seconds for two days.*/
group by a.obs
;
quit;
proc sort data=want;
by descending date descending time;
run;
[/pre]

Ksharp

NIUNIU · Posted 06-08-2011 12:33 AM

Hi Ksharp,

You really rocks. The code works great for me. However, I think I did not really master how to use Cartesian Product of sql. I really appreciate if you could help on how to resolve the following issue. The difficult part for me is how to select the at least three grade C and group them together. Do you have any good document to recommend to read on Cartesian Product of sql. Thank you so much!

The logic is "in the same account, the id is unique id and has at least 3 grade C within 2 days, then the flag is 1, else is 0. " For example, below data only obs 4 and 11 meets the requirement.

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount grade $ flag ;
format date mmddyy10. time time9.;
datetime= dhms(date,0,0,time);
format datetime datetime.;
datalines;
1 387000 6/10/2010 12:00:00 1 13.15 2000 A 0
2 387000 6/9/2010 12:00:00 1 13.15 3000 A 0
3 387000 6/8/2010 12:00:00 1 13.14 2000 B 0
4 387000 6/6/2010 11:00:00 1 13.15 2000 C 1
5 387000 6/5/2010 10:00:00 1 13.14 2000 C 0
6 387000 6/4/2010 9:00:00 1 13.13 1000 B 0
7 387000 6/4/2010 8:00:00 1 13.13 500 C 0
8 386000 5/9/2010 10:00:00 1 12.12 5000 C 0
9 386000 5/2/2010 9:00:00 1 12.14 2000 A 0
10 386000 5/1/2010 8:00:00 1 12.13 2000 B 0
11 385000 3/3/2010 10:00:00 1 11.13 5000 C 1
12 385000 3/1/2010 9:00:00 1 11.12 3000 C 0
13 385000 3/1/2010 8:00:00 1 11.11 3000 C 0
;
run;

Ksharp · Posted 06-08-2011 03:02 AM

What does "the id is unique id" mean?
And there is a problem. As your logic ,there will not be any obs matched your codition.
For example:
Obs 4 's two days range is " 6/6/2010 11:00:00 " to "6/4/2010 11:00:00 " ,so Obs 6 and 7
will not be matched with Obs 4 ,because their datetime is out of range.
But you can change 172800 to meet what you want, It is up to you.

[pre]

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount grade $ ;
format date mmddyy10. time time9.;
datetime= dhms(date,0,0,time);
format datetime datetime.;
datalines;
1 387000 6/10/2010 12:00:00 1 13.15 2000 A
2 387000 6/9/2010 12:00:00 1 13.15 3000 A
3 387000 6/8/2010 12:00:00 1 13.14 2000 B
4 387000 6/6/2010 11:00:00 1 13.15 2000 C
5 387000 6/5/2010 10:00:00 1 13.14 2000 C
6 387000 6/4/2010 9:00:00 1 13.13 1000 B
7 387000 6/4/2010 8:00:00 1 13.13 500 C
8 386000 5/9/2010 10:00:00 1 12.12 5000 C
9 386000 5/2/2010 9:00:00 1 12.14 2000 A
10 386000 5/1/2010 8:00:00 1 12.13 2000 B
11 385000 3/3/2010 10:00:00 1 11.13 5000 C
12 385000 3/1/2010 9:00:00 1 11.12 3000 C
13 385000 3/1/2010 8:00:00 1 11.11 3000 C
;
run;
proc sql ;
create table want(drop=sum_c) as
select distinct a.*,sum(case when b.grade eq 'C' then 1 else 0 end) as sum_c,
case when calculated sum_c ge 3 then 1 else 0 end as flag
from temp as a ,temp as b
where a.account = b.account and b.datetime between a.datetime-172800 and a.datetime
/*172800 is the number of seconds for two days.*/
group by a.obs
;
quit;
proc sort data=want;
by descending date descending time;
run;
[/pre]

Ksharp

Ksharp · Posted 06-08-2011 03:09 AM

Or Maybe you can start from the beginning of every day .
[pre]

data temp;
input obs account $ date : mmddyy10. time : time9. test id amount grade $ ;
format date mmddyy10. time time9.;
datetime= dhms(date,0,0,0);
format datetime datetime.;
datalines;
1 387000 6/10/2010 12:00:00 1 13.15 2000 A
2 387000 6/9/2010 12:00:00 1 13.15 3000 A
3 387000 6/8/2010 12:00:00 1 13.14 2000 B
4 387000 6/6/2010 11:00:00 1 13.15 2000 C
5 387000 6/5/2010 10:00:00 1 13.14 2000 C
6 387000 6/4/2010 9:00:00 1 13.13 1000 B
7 387000 6/4/2010 8:00:00 1 13.13 500 C
8 386000 5/9/2010 10:00:00 1 12.12 5000 C
9 386000 5/2/2010 9:00:00 1 12.14 2000 A
10 386000 5/1/2010 8:00:00 1 12.13 2000 B
11 385000 3/3/2010 10:00:00 1 11.13 5000 C
12 385000 3/1/2010 9:00:00 1 11.12 3000 C
13 385000 3/1/2010 8:00:00 1 11.11 3000 C
;
run;
proc sql ;
create table want(drop=sum_c) as
select distinct a.*,sum(case when b.grade eq 'C' then 1 else 0 end) as sum_c,
case when calculated sum_c ge 3 then 1 else 0 end as flag
from temp as a ,temp as b
where a.account = b.account and b.datetime between a.datetime-172800 and a.datetime
/*172800 is the number of seconds for two days.*/
group by a.obs
;
quit;
proc sort data=want;
by descending date descending time;
run;

[/pre]

Ksharp

NIUNIU · Posted 06-30-2011 12:04 AM

Hi Ksharp,

Thank you very much for all the help. However, these days I am still struggling with how to code the unique id part. See if I can explain more clearly.

The general logic is "The id is first time appears within -30 days and -7 days from the same account. If so, the flag is 1, else the flag is 0". I want to produce the last column flag.

obs	account	date	test	id	flag
1	384000	6/30/2010	1	13.15	1
2	384000	6/15/2010	1	13.14	1
3	384000	6/10/2010	1	13.14	1
4	384000	6/1/2010	1	13.13	0
5	386000	5/30/2010	1	12.12	0
6	386000	5/15/2010	1	12.12	1
7	386000	4/25/2010	1	12.13	0

For example, for obs 1, the id is 13.15, between 30 and 7 days, there is no same id is 13.15. So the flag is 1.

however, for obs 2, the id is 13.14, even though there is another 13.14 appears in obs 3, but the flag is still 1, since between 06/15 and 06/10, there is only 5 days difference, not more than 7 days difference. We will ignore the first 7 days observation. Therefore, this obs 3 will not count. And again, after we ignore the first 7 days, between -7 and -30 days, there is no same 13.14 id, so the flag will be 1.

Obs 4 will be 0, since this is no history to compare.

Obs 5 will be 0, since there is another id 12.12 in obs 6, and the date is between 7 days and 30 days difference.

I am struggling how to code to ignore the first 7 days, and only count between 7 days and 30 days range.

I really appreciate if you could help me how to resolve this puzzle.

Thank you very much!

Ksharp · Posted 06-30-2011 02:33 AM

OK.Cartesian Product of DataStep is more powerful than Cartesian Product of Sql. So I used it for your a little complicated problem.

data temp;
input account $ date : mmddyy10. test id;
format date mmddyy10.;
cards;
384000 6/30/2010 1 13.15
384000 6/15/2010 1 13.14     
384000 6/10/2010 1 13.14     
384000 6/1/2010  1 13.13
386000 5/30/2010 1 12.12
386000 5/15/2010 1 12.12
386000 4/25/2010 1 12.13
;
run;

data want;
 set temp;
 found=0; matched=0;flag=0;
 do i=1 to _nobs;
  set temp(drop=test rename=(account=_account date=_date id=_id))
      nobs=_nobs point=i;
  if account=_account and _date ge date-30 and _date le date-7
     and id=_id  then  found=1;
  if account=_account and _date ge date-30 and _date le date-7 then matched=1;
  if i=_nobs then do;
                    if not found and matched then flag=1;
                    output;
                  end;
 end;
 drop found matched _:;
run;

Ksharp

NIUNIU · Posted 06-30-2011 03:19 AM

Hi Ksharp,

Sincerely appreciate all your help. This is really something. Definitely learned a lot from you.

Thank you very much!!!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!