Questions about rules selection Method in SAS Base

EC189QRW · Posted 09-26-2018 04:18 AM

I’ve got 300 rules on my rule list. Plenty of those haven’t updated for a really long time. I’d like to select some of rules based on the index given by confusion matrix, eg. TPR (Ture positive rate), PV+ (Positive predicted value) from transaction table. Maybe 60 out 300 did 99% of job based on TPR. Then the 60 rules would become my new Pool of Rules. I will drop the rest 240 rules and refresh the rule list thereafter. My basic logic is which rule gives the highest TPR from transaction would come into the Pools at first. For example , R2 gives the highest TPR. Then it comes in at first. Then the rest of rules who comes to the pool gives the highest TPR with R2 would become part of Pools.

Because of there might be some overlap between each rules. So we need to calculate the TPR at each time. Make the best choice each round. The iteration would go on until the difference between TPR of Pool A and TPR of Pool B is like 0.01, I mean it would diverged at some point.

At present, I could create a table of TPR and PV+ for each rules from transaction table. But I don’t know how to dynamically create a sequence of rules list and abstract some of those which gives the most TPR increase out as variables. Hope there is someone who can help me and give me some clue how to tackle the problems. Thanks at first.

Here is transaction sample data. Seq stands for sequence,gb stands for GOOD/BAD ,r1-r5 stands for Rule1-Rule5.

data trx;
input seq gb r1 r2 r3 r4 r5;
cards;
1 1 0 0 0 1 0
2 0 0 0 1 0 1
3 1 1 0 0 0 1
4 0 0 0 0 0 1
5 0 1 0 0 1 0
6 0 0 1 0 0 0
7 1 0 1 0 0 1
8 0 0 0 0 0 0
9 0 0 0 1 0 0
10 1 1 0 0 0 0
11 0 0 0 0 1 0
12 1 1 1 1 1 0
13 1 0 1 1 0 0
14 0 1 0 0 1 1
15 1 1 0 0 0 1
16 1 0 0 0 1 0
17 0 0 0 0 0 1
18 1 0 1 0 1 0
19 0 0 0 0 0 1
20 0 0 1 1 1 0
21 0 1 1 0 1 1
22 0 0 1 0 0 0
23 1 1 0 1 1 1
24 0 0 1 0 0 0
25 1 0 1 1 0 1
26 0 0 0 0 1 0
27 0 0 0 1 1 0
28 0 0 0 0 0 1
29 0 0 0 0 0 0
30 1 0 1 1 1 1
31 0 1 0 0 0 0
32 1 0 1 0 1 1
33 0 1 0 0 0 0
34 1 0 0 1 0 1
35 0 1 0 0 1 0
36 0 0 0 0 1 0
37 0 0 0 0 0 1
38 0 1 0 0 1 0
39 1 1 1 1 0 0
40 0 1 1 0 1 0
;

ballardw · Posted 09-26-2018 05:31 PM

I've read this three times now. I have to say I haven't a clue of what you actually want.

How do you get TPR (Ture positive rate), PV+ (Positive predicted value) from that data? You also say "So we need to calculate the TPR at each time". What indicate "each time" in that data set?

What do you want the final dataset to look like?

How do you apply any of the "rules"?

EC189QRW · Posted 09-27-2018 07:54 AM

Thank you for your reply. Sorry for my misleading. I forgot some important information.

The sample dataset is from credit card transactions. Some of those might be fraudulent, some of those might be normal. Basically, we implemented these rules to label highly suspicious transactions. GB =1 means an actual fraud transaction, GB=0 means a non-fraud transaction. R1 to R5 stands for different rules we used to label suspicious transactions. For instance, R1=1 means the transaction labeled as a fraudulent transaction. R1=0 means the transaction labeled as a real transaction. So confusion matrix could be used here to select effective rules. Our major concerns for these rules is TPR (Ture positive rate) and PV+ (Positive predicted value) ,TPR=true positive/total actual positive=d/c+d ,PV+=true positive/ total predicted positive=d/b+d. As our pool of rules is almost full so I’d like to select a sequence of effective rules out of pools and implemented in a system which might give a relief to our server.

	Predicted:1	Predictied:0
actual:1	d, True Positive	c, False Negative	c+d, Actual Positive
actual:0	b, False Positive	a, True Negative	a+b, Actual Negative
	b+d, Predicted Positive	a+c, Predicted Negative

I’d like to get a rule list like r2,r1,r4,r3 as follows.

Obs	rule	ruleselected	accuracy	errorate	Tpr	Pvplus	Tnr	PvMinus
1	RuleX	r2,r1,r4,r3	0.55	0.45	1	0.45455	0.28	1

The first round selection of rules is R2 because of its highest TPR in the rule list. Then R2 becomes part of rules of pool. The second round I need to calculate R2R1, R2R3,R2R4,R2R5 and try to select the highest TPR out of second round rule list and add the second rule to the rules of pool, for example R1. The process continue until there would be no increase in TPR for the pools. Then the iteration stops. I don’t know if I made point clear. If you have any questions, please leave a comment. Thank you for your time and really appreciate.

Questions about rules selection Method in SAS Base

Re: Questions about rules selection Method in SAS Base

Re: Questions about rules selection Method in SAS Base

Questions about rules selection Method in SAS Base

Re: Questions about rules selection Method in SAS Base

Re: Questions about rules selection Method in SAS Base

Click image to register for webinar

Classroom Training Available!