Hello everyone,
As every model developer know, to do some validation tests, you should make bucketing in your data set and compare and do some tests between your development and performance data sets. At this point, to make this, I found a following link as below but I could not adjust that example for my sample;
https://communities.sas.com/t5/SAS-Procedures/Help-Using-Proc-Rank-With-Two-Datasets/td-p/124480
Based on above sample, I perfomed my sampe over SASHELP.CLASS, Let's pretend this is my development sample and I could bucketing as below;
proc rank data=sashelp.class out=mranks groups=3;
var age;
ranks rage;
run;
PROC TABULATE DATA=WORK.MRANKS;
CLASS rage / ORDER=UNFORMATTED MISSING;
CLASS Sex / ORDER=UNFORMATTED MISSING;
TABLE /* Row Dimension */rage,
/* Column Dimension */Sex*N ALL={LABEL="Total (ALL)"}* N ;
RUN;
Then I made simple changes and prepared a performance data set as below;
Data PerfClass;
Length Name $ 32 Gender $ 1 Age 8;
Infile Datalines dlm=" " Missover;
Input Name Gender Age;
Datalines;
Alfred M 15
Alice F 14
Barbara F 15
Carol F 15
Henry M 14
James M 13
Jane F 11
Janet F 15
Jeffrey M 12
John M 13
Joyce F 11
Judy F 12
Louise F 13
Mary F 12
Philip M 15
Robert M 11
Ronald M 14
Thomas M 11
William M 15
;
I would like to get Female and Male counts based on Train data set bucket ranges. How can I do it?
I am not sure whether my sample is correct or not but I hope I could made myself clear 🙂
@josh_wander, @MikeZdeb, @Linlin, @PaigeMiller
Thanks
Your link doesn't work.
It is also not clear to me how PROC RANK or PROC TABULATE has anything to do with separating data into train and validation data sets. Could you explain that further? Could we also stick with common terminology, "training" and "validation" rather than other terms?
Hello,
I fixed the link
I tried to tell via sample default data sets. The following data set is train data set and PROC RANK and PROC TABULATE provides me to see number of count for every group(bucket)
Data TRAIN;
Length Score 8 Target 8 P_Target 8;
Infile Datalines missover;
Input Score Target P_Target;
Datalines;
1200 1 0.7
1210 0 0.6
1220 1 0.8
1230 0 0.5
1240 1 0.8
1250 0 0.5
1260 1 0.8
1270 0 0.5
1280 1 0.9
1290 0 0.5
1300 0 0.4
1310 1 0.8
1320 0 0.5
1330 0 0.6
1340 1 0.8
1350 0 0.2
1360 0 0.3
1370 1 0.7
1380 0 0.4
1390 0 0.6
1400 1 0.7
1410 0 0.5
1420 0 0.4
1430 0 0.6
1440 1 0.8
1450 0 0.4
1460 0 0.6
1470 0 0.5
1480 1 0.8
1490 0 0.4
1500 0 0.6
;
Run;
proc rank data=TRAIN out=TRAIN_RANKED groups=3 ties=mean;
var SCORE;
ranks RSCORE;
run;
PROC TABULATE DATA=TRAIN_RANKED;
CLASS RSCORE / ORDER=UNFORMATTED MISSING;
CLASS TARGET / ORDER=UNFORMATTED MISSING;
TABLE /* Row Dimension */RSCORE,
/* Column Dimension */TARGET*N ALL={LABEL="Total (ALL)"}* N ;
RUN;
I also have following Validation data set and I want to apply Train data set ranges on Validation data set to see number of count Validation files between Train data set ranges. How can I do it?
Data PERF;
Length Score 8 Target 8 ;
Infile Datalines missover;
Input Score Target;
Datalines;
1250 1
1260 1
1270 1
1280 0
1290 1
1295 0
1296 1
1297 0
1298 1
1299 0
1300 0
1350 1
1360 0
1370 1
1380 1
1390 0
1395 0
1396 0
1397 0
1398 1
1400 1
1450 0
1460 0
1470 1
1480 1
1490 0
1495 1
1496 0
1497 1
1498 0
1600 1
;
Run;
My desired output as below;
Rank | Train0 | Train1 | TrainAll | Perf0onTrain | Perf1onTrain | PerfAllonTrain | MinTrain | MaxTrain |
0 | 5 | 5 | 10 | 4 | 6 | 10 | 0 | 1300 |
1 | 7 | 4 | 11 | 5 | 5 | 11 | 1300 | 1410 |
2 | 8 | 2 | 10 | 5 | 5 | 10 | 1410 | 1500 |
@PaigeMiller@Ksharp @Linlin @MikeZdeb @josh_wander
Thanks
If I am understanding this properly, I don't think PROC RANK is the way to go at all. I think you want to calculate the 33.333 percentile and the 66.667 percentile of your training data set. This can be done via PROC UNIVARIATE, and these percentiles can be output to a SAS data set.
Once the 33.333 and 66.667 percentiles are in a SAS data set, it is easy to apply them to the validation data set. This can be done via a series of IF-THEN statements in a DATA step.
I think you missed option
proc rank data=sashelp.class out=mranks groups=3 ties=low|high|dense ;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.