Contributor
Posts: 27

# Automating cross validation

[ Edited ]

Hi all,

The  macro below perform 10 fold cross validation. The macro compute the predicted probabilities in each of  test data generated and append the results in the predicted dataset.

I need to repeat the same process 500 times (currently the process is run 1 time). I was if you would be able to advise how I can modify the macro so that I can repeat this process.

Note: the data and the codes are below:

``````
data kyphosis;
set kyphosis;
theRandom = ranuni(86);
run;

*Then, divide the dataset into 10 groups based on the random number;
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;``````

/*STEP 2: Repeat the following 10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/

/*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
sure that there is not already a dataset called predicted.*/

``````proc datasets library = work nodetails nolist;
delete predicted;
run;
quit;``````

/*The MACRO*/

``````%macro runit;
%do x = 0 %to 10; * asks SAS to repeat the steps 10 times;
proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;
data training&x.;
set kRanked;
where theRandom ne &x;
run;
data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;

proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;
score data= test&x. out = predicted&x.;
run;
proc append base = predicted data = predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%mend runit;
%runit;``````

My dataset called  kyphosis  can be found below

 y1 kyphosis 28 0 15.5 0 8.2 0 3.4 0 17.3 0 15.2 0 32.9 0 11.1 0 87.5 0 16.2 0 107.9 0 5.7 0 25.6 0 31.2 0 21.6 0 55.6 0 8.8 0 6.5 0 22.1 0 14.4 0 44.2 0 3.7 0 7.8 0 8.9 0 18 0 6.5 0 4.9 0 10.4 0 5 0 5.3 0 6.5 0 6.9 0 8.2 0 21.8 0 6.6 0 7.6 0 15.4 0 59.2 0 5.1 0 10 0 5.3 0 32.6 0 4.6 0 6.9 0 4 0 3.65 0 7.8 0 32.5 0 11.5 0 4 0 10.2 0 2.4 1 719 1 2106.667 1 24000 1 1715 1 3.6 1 521.5 1 1600 1 454 1 109.7 1 23.7 1 464 1 9810 1 255 1 58.7 1 225 1 90.1 1 50 1 5.6 1 4070 1 592 1 28.6 1 6160 1 1090 1 10.4 1 27.3 1 162 1 3560 1 14.7 1 83.3 1 336 1 55.7 1 1520 1 3.9 1 5.8 1 8.45 1 361 1 369 1 8230 1 39.3 1 43.5 1 361 1 12.8 1 18 1 9590 1 555 1 60.2 1 21.8 1 900 1 6.6 1 239 1 3100 1 3275 1 682 1 85.4 1 10290 1 770 1 247.6 1 12320 1 113.1 1 1079 1 45.6 1 1630 1 79.4 1 508 1 3190 1 542 1 1021 1 235 1 251 1 3160 1 479 1 222 1 15.7 1 2540 1 11630 1 1810 1 6.9 1 4.1 1 15.6 1 9820 1 1490 1 15.7 1 45.8 1 7.8 1 12.8 1 100.5333 1 227 1 70.9 1 2500 1
PROC Star
Posts: 8,151

## Re: Top 10 SAS Functions

Which process do you want to repeat 500 times? Run the macro 500 times or do the entire process 500 times?

Since you initially select a random sequence based on a specific seed, do you want to do that same thing 500 times, or replace it with a different seed each time.

I presume that you want to generate 500 different output files, but you haven't specified what you want.

Art, CEO, AnalystFinder.com

Contributor
Posts: 27

## Re: Top 10 SAS Functions

Sorry for not being clear.

Yes - I want to generate 500 different output files.

so to do that I need to do the entire process 50 times, i.e, 1 process generates 10 different output files.
PROC Star
Posts: 8,151

## Re: Top 10 SAS Functions

You could just add an outer loop within your current macro. However, if you use the same seed (like you did), all 50 replications will be identical. As such, I moved the random selection into the macro, and used a seed of 0.

Also, I presume you only want 50 files and that the ten files created each time can be overwritten.

Check to see if the following does what you want:

```data kyphosis;
infile cards dlm='09'x;
input y1	kyphosis;
cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/

/*The MACRO*/
%macro runit;
%do i=1 %to 50;
data kyphosis;
set kyphosis;
theRandom = ranuni(0);
run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
/*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
sure that there is not already a dataset called predicted.*/

proc datasets library = work nodetails nolist;
delete predicted&i.;
run;
quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;

data training&x.;
set kRanked;
where theRandom ne &x;
run;

data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;

proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;
```

Art, CEO, AnalystFinder.com

Contributor
Posts: 27

## Re: Top 10 SAS Functions

Many thanks. I really appreciated your help.

So in my case I want to keep all 500 files and that the ten files created each time need to be kept for further analysis.

I was wondering if you can advise me how to keep those 500 files?

Many thanks
PROC Star
Posts: 8,151

## Re: Top 10 SAS Functions

The following (not tested though) keeps all 500 files, as well as all 500 models and training datasets:

```data kyphosis;
infile cards dlm='09'x;
input y1	kyphosis;
cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/

/*The MACRO*/
%macro runit;
%let counter=1;
%do i=1 %to 50;
data kyphosis;
set kyphosis;
theRandom = ranuni(0);
run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
/*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
sure that there is not already a dataset called predicted.*/

proc datasets library = work nodetails nolist;
delete predicted&i.;
run;
quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

%let counter=%eval(&counter+1);

proc logistic data = kRanked outmodel = model&counter.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;

data training&counter.;
set kRanked;
where theRandom ne &x;
run;

data test&counter.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;

proc logistic inmodel = model&counter.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&counter.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;```

Art, CEO, AnalystFinder.com

Contributor
Posts: 27

## Re: Top 10 SAS Functions

Many thanks. It does indeed give me what I wanted.

Contributor
Posts: 27

## Re: Top 10 SAS Functions

Hi

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a  dataset. as below and repeat the process for each y1 in the dataset.

I was wondering if you can help to update my codes below to have the results I need.

DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;

* Calculate Sensitivity and Specificity with 39.2 as a cut;

proc freq data = kyphosis order = formatted;

tables kyphosis * y11 / nocol nopercent;

run;

HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2

The FREQ Procedure

 Frequency Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

 46 90.2
 5 9.8
 51

1

 22 24.44
 68 75.56
 90

Total

 68
 73
 141
 The SAS System

APEND THE RESULTS AS FOLLOWING:

 Y1 Prevalence Sensitivity Specificity PPV ( Positive Predictive Vlue) NPV ( Negative Predictive value) adjusted 39.2 0.07 75.56 90.20 9.80 24.44 2.4 0.07

I need to repeat this process for each result of y1 ( 2.4 to 24000)  and append the results in a  dataset.

I was wondering if you can help to update my code to do this.

Thanks,

Contributor
Posts: 27

## Re: Top 10 SAS Functions

Hi art297,

Using the same above dataset (kyphosis).

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a dataset. as below and repeat the process for each y1 in the dataset.

I was wondering if you can help to update my codes below to have the results I need.

Here is the codes I have used:

DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;

* Calculate Sensitivity and Specificity with 39.2 as a cut;

proc freq data = kyphosis order = formatted;

tables kyphosis * y11 / nocol nopercent;

run;

HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2

The FREQ Procedure

Frequency

Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

46

90.20

5

9.80

51

1

22

24.44

68

75.56

90

Total

68

73

141

The SAS System

APEND THE RESULTS AS FOLLOWING:

Y1 = 39.2

Sensitivity = 75.56

Specificity= 90.20

PPV ( Positive Predictive Value) = 9.80

NPV = 24.44

I need to repeat this process for each result of y1 ( 2.4 to 24000) and append the results in a dataset.

I was wondering if you can help to update my code to do this.

Thanks,
Posts: 1,312

## Re: Automating cross validation

I don't know about the rest of the program, but I would skip the proc rank and use    rand('table',...) function to randomly assign groups 0 to 9.  Here's how:

``````data kyphosis (drop=_:);
set kyphosis nobs=nrecs;
array needed {10} _temporary_;
retain _nremain;
if _n_=1 then do;
_nremain=nrecs;
do _col=1 to 10; needed{_col}=ceil(nrecs/10); end;
end;

call streaminit(01982066);
array prb{10} _temporary_ ;
do _col=1 to 10;  prb{_col}=needed{_col}/_nremain; end;

rnd=rand('table',of prb{*});
needed{rnd}=needed{rnd}-1;
_nremain=_nremain-1;
run;

``````

Moreover, you can do it for 500 variables at once:

``````data kyphosis (drop=_:);
set kyphosis nobs=nrecs;
array needed {500,10} _temporary_;
retain _nremain;
if _n_=1 then do;
_nremain=nrecs;
do _row=1 to 500;
do _col=1 to 10; needed{_row,_col}=ceil(nrecs/10); end;
end;
end;
call streaminit(01982066);
array _prb{10} _temporary_;
array rnd{500};
do _row=1 to 500;
do _col=1 to 10;  _prb{_col}=min(1,needed{_row,_col}/_nremain); end;
rnd{_row}=rand('table',of _prb{*});
needed{_row,rnd{_row}}=needed{_row,rnd{_row}}-1;
end;
_nremain=_nremain-1;
run;
``````

This will eliminate 499 data steps and 500 proc ranks at the beginning of your script.

Contributor
Posts: 27

## Re: Automating cross validation

Many thanks

Discussion stats
• 10 replies
• 257 views
• 0 likes
• 3 in conversation