topic Re: Bucketing between Train and Performance Data Set in SAS Programming

Bucketing between Train and Performance Data Set

ertr — Fri, 14 Sep 2018 12:21:13 GMT

Hello everyone,

As every model developer know, to do some validation tests, you should make bucketing in your data set and compare and do some tests between your development and performance data sets. At this point, to make this, I found a following link as below but I could not adjust that example for my sample;

https://communities.sas.com/t5/SAS-Procedures/Help-Using-Proc-Rank-With-Two-Datasets/td-p/124480

Based on above sample, I perfomed my sampe over SASHELP.CLASS, Let's pretend this is my development sample and I could bucketing as below;

proc rank data=sashelp.class out=mranks groups=3;
var age;
ranks rage;
run;

PROC TABULATE DATA=WORK.MRANKS;	
	CLASS rage /	ORDER=UNFORMATTED MISSING;
	CLASS Sex /	ORDER=UNFORMATTED MISSING;
	TABLE /* Row Dimension */rage,
/* Column Dimension */Sex*N ALL={LABEL="Total (ALL)"}* N 		;
RUN;

Then I made simple changes and prepared a performance data set as below;

Data PerfClass;
Length Name $ 32 Gender $ 1 Age 8;
Infile Datalines dlm="	" Missover;
Input Name Gender Age;
Datalines;
Alfred	M	15
Alice	F	14
Barbara	F	15
Carol	F	15
Henry	M	14
James	M	13
Jane	F	11
Janet	F	15
Jeffrey	M	12
John	M	13
Joyce	F	11
Judy	F	12
Louise	F	13
Mary	F	12
Philip	M	15
Robert	M	11
Ronald	M	14
Thomas	M	11
William	M	15
;

I would like to get Female and Male counts based on Train data set bucket ranges. How can I do it?

I am not sure whether my sample is correct or not but I hope I could made myself clear 🙂

@josh_wander, @MikeZdeb, @Linlin, @PaigeMiller

Thanks

Re: Bucketing between Train and Performance Data Set

PaigeMiller — Fri, 14 Sep 2018 12:15:58 GMT

Your link doesn't work.

It is also not clear to me how PROC RANK or PROC TABULATE has anything to do with separating data into train and validation data sets. Could you explain that further? Could we also stick with common terminology, "training" and "validation" rather than other terms?

Re: Bucketing between Train and Performance Data Set

Ksharp — Fri, 14 Sep 2018 13:03:58 GMT

I think you missed option

proc rank data=sashelp.class out=mranks groups=3 ties=low|high|dense ;

Re: Bucketing between Train and Performance Data Set

ertr — Fri, 14 Sep 2018 18:39:41 GMT

Hello,

I fixed the link

I tried to tell via sample default data sets. The following data set is train data set and PROC RANK and PROC TABULATE provides me to see number of count for every group(bucket)

Data TRAIN;
Length Score 8 Target 8 P_Target 8;
Infile Datalines missover;
Input Score Target P_Target;
Datalines;
1200 1 0.7
1210 0 0.6
1220 1 0.8
1230 0 0.5
1240 1 0.8
1250 0 0.5
1260 1 0.8
1270 0 0.5
1280 1 0.9
1290 0 0.5
1300 0 0.4
1310 1 0.8
1320 0 0.5
1330 0 0.6
1340 1 0.8
1350 0 0.2
1360 0 0.3
1370 1 0.7
1380 0 0.4
1390 0 0.6
1400 1 0.7
1410 0 0.5
1420 0 0.4
1430 0 0.6
1440 1 0.8
1450 0 0.4
1460 0 0.6
1470 0 0.5
1480 1 0.8
1490 0 0.4
1500 0 0.6
;
Run;

proc rank data=TRAIN out=TRAIN_RANKED groups=3 ties=mean;
var SCORE;
ranks RSCORE;
run;

PROC TABULATE DATA=TRAIN_RANKED;	
	CLASS RSCORE /	ORDER=UNFORMATTED MISSING;
	CLASS TARGET /	ORDER=UNFORMATTED MISSING;
	TABLE /* Row Dimension */RSCORE,
/* Column Dimension */TARGET*N ALL={LABEL="Total (ALL)"}* N 		;
RUN;

I also have following Validation data set and I want to apply Train data set ranges on Validation data set to see number of count Validation files between Train data set ranges. How can I do it?

Data PERF;
Length Score 8 Target 8 ;
Infile Datalines missover;
Input Score Target;
Datalines;
1250 1
1260 1
1270 1
1280 0
1290 1
1295 0
1296 1
1297 0
1298 1
1299 0
1300 0
1350 1
1360 0
1370 1
1380 1
1390 0
1395 0
1396 0
1397 0
1398 1
1400 1
1450 0
1460 0
1470 1
1480 1
1490 0
1495 1
1496 0
1497 1
1498 0
1600 1
;
Run;

My desired output as below;

Rank	Train0	Train1	TrainAll	Perf0onTrain	Perf1onTrain	PerfAllonTrain	MinTrain	MaxTrain
0	5	5	10	4	6	10	0	1300
1	7	4	11	5	5	11	1300	1410
2	8	2	10	5	5	10	1410	1500

@PaigeMiller @Ksharp @Linlin @MikeZdeb @josh_wander

Thanks

Re: Bucketing between Train and Performance Data Set

PaigeMiller — Fri, 14 Sep 2018 19:42:41 GMT

If I am understanding this properly, I don't think PROC RANK is the way to go at all. I think you want to calculate the 33.333 percentile and the 66.667 percentile of your training data set. This can be done via PROC UNIVARIATE, and these percentiles can be output to a SAS data set.

Once the 33.333 and 66.667 percentiles are in a SAS data set, it is easy to apply them to the validation data set. This can be done via a series of IF-THEN statements in a DATA step.