DATA Step, Macro, Functions and more

Automating cross validation

Reply
Contributor
Posts: 27

Automating cross validation

[ Edited ]

Hi all,

 

The  macro below perform 10 fold cross validation. The macro compute the predicted probabilities in each of  test data generated and append the results in the predicted dataset. 

 

I need to repeat the same process 500 times (currently the process is run 1 time). I was if you would be able to advise how I can modify the macro so that I can repeat this process.

 

Note: the data and the codes are below:

 

 


data kyphosis;
set kyphosis;
theRandom = ranuni(86);
run;

*Then, divide the dataset into 10 groups based on the random number;
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;


proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;

/*STEP 2: Repeat the following 10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/

/*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
sure that there is not already a dataset called predicted.*/

proc datasets library = work nodetails nolist;
delete predicted;
run;
quit;

/*The MACRO*/

%macro runit;
%do x = 0 %to 10; * asks SAS to repeat the steps 10 times;
proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;
data training&x.;
set kRanked;
where theRandom ne &x;
run;
data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;

proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;
score data= test&x. out = predicted&x.;
run;
proc append base = predicted data = predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%mend runit;
%runit;

 

My dataset called  kyphosis  can be found below

 

y1 kyphosis
28 0
15.5 0
8.2 0
3.4 0
17.3 0
15.2 0
32.9 0
11.1 0
87.5 0
16.2 0
107.9 0
5.7 0
25.6 0
31.2 0
21.6 0
55.6 0
8.8 0
6.5 0
22.1 0
14.4 0
44.2 0
3.7 0
7.8 0
8.9 0
18 0
6.5 0
4.9 0
10.4 0
5 0
5.3 0
6.5 0
6.9 0
8.2 0
21.8 0
6.6 0
7.6 0
15.4 0
59.2 0
5.1 0
10 0
5.3 0
32.6 0
4.6 0
6.9 0
4 0
3.65 0
7.8 0
32.5 0
11.5 0
4 0
10.2 0
2.4 1
719 1
2106.667 1
24000 1
1715 1
3.6 1
521.5 1
1600 1
454 1
109.7 1
23.7 1
464 1
9810 1
255 1
58.7 1
225 1
90.1 1
50 1
5.6 1
4070 1
592 1
28.6 1
6160 1
1090 1
10.4 1
27.3 1
162 1
3560 1
14.7 1
83.3 1
336 1
55.7 1
1520 1
3.9 1
5.8 1
8.45 1
361 1
369 1
8230 1
39.3 1
43.5 1
361 1
12.8 1
18 1
9590 1
555 1
60.2 1
21.8 1
900 1
6.6 1
239 1
3100 1
3275 1
682 1
85.4 1
10290 1
770 1
247.6 1
12320 1
113.1 1
1079 1
45.6 1
1630 1
79.4 1
508 1
3190 1
542 1
1021 1
235 1
251 1
3160 1
479 1
222 1
15.7 1
2540 1
11630 1
1810 1
6.9 1
4.1 1
15.6 1
9820 1
1490 1
15.7 1
45.8 1
7.8 1
12.8 1
100.5333 1
227 1
70.9 1
2500 1
PROC Star
Posts: 8,151

Re: Top 10 SAS Functions

The title of your question is totally misleading.

 

Which process do you want to repeat 500 times? Run the macro 500 times or do the entire process 500 times?

 

Since you initially select a random sequence based on a specific seed, do you want to do that same thing 500 times, or replace it with a different seed each time.

 

I presume that you want to generate 500 different output files, but you haven't specified what you want.

 

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 27

Re: Top 10 SAS Functions

Sorry for not being clear.

Yes - I want to generate 500 different output files.

so to do that I need to do the entire process 50 times, i.e, 1 process generates 10 different output files.
PROC Star
Posts: 8,151

Re: Top 10 SAS Functions

You could just add an outer loop within your current macro. However, if you use the same seed (like you did), all 50 replications will be identical. As such, I moved the random selection into the macro, and used a seed of 0.

 

Also, I presume you only want 50 files and that the ten files created each time can be overwritten.

 

Check to see if the following does what you want:

data kyphosis;
  infile cards dlm='09'x;
  input y1	kyphosis;
  cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/


/*The MACRO*/
%macro runit;
  %do i=1 %to 50;
    data kyphosis;
      set kyphosis;
      theRandom = ranuni(0);
    run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
  /*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
  sure that there is not already a dataset called predicted.*/

  proc datasets library = work nodetails nolist;
    delete predicted&i.;
  run;
  quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;

data training&x.;
set kRanked;
where theRandom ne &x;
run;

data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;


proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;

Art, CEO, AnalystFinder.com

 

 

 

Contributor
Posts: 27

Re: Top 10 SAS Functions

Many thanks. I really appreciated your help.

So in my case I want to keep all 500 files and that the ten files created each time need to be kept for further analysis.

I was wondering if you can advise me how to keep those 500 files?

Many thanks
PROC Star
Posts: 8,151

Re: Top 10 SAS Functions

The following (not tested though) keeps all 500 files, as well as all 500 models and training datasets:

 

data kyphosis;
  infile cards dlm='09'x;
  input y1	kyphosis;
  cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/


/*The MACRO*/
%macro runit;
  %let counter=1;
  %do i=1 %to 50;
    data kyphosis;
      set kyphosis;
      theRandom = ranuni(0);
    run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
  /*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
  sure that there is not already a dataset called predicted.*/

  proc datasets library = work nodetails nolist;
    delete predicted&i.;
  run;
  quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

%let counter=%eval(&counter+1);

proc logistic data = kRanked outmodel = model&counter.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;


data training&counter.;
set kRanked;
where theRandom ne &x;
run;

data test&counter.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;


proc logistic inmodel = model&counter.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&counter.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;

Art, CEO, AnalystFinder.com

 

 

 

Contributor
Posts: 27

Re: Top 10 SAS Functions

Many thanks. It does indeed give me what I wanted.

Much appreciated your help
Contributor
Posts: 27

Re: Top 10 SAS Functions

Hi 

 

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a  dataset. as below and repeat the process for each y1 in the dataset. 

 

I was wondering if you can help to update my codes below to have the results I need. 

 

 

DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;

 

* Calculate Sensitivity and Specificity with 39.2 as a cut;

 

 

proc freq data = kyphosis order = formatted;

            tables kyphosis * y11 / nocol nopercent;

run;

 

HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2

 

The FREQ Procedure

Frequency

Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

46

90.20

5

9.80

51

 

1

22

24.44

68

75.56

90

 

Total

68

73

141

The SAS System

 

APEND THE RESULTS AS FOLLOWING:

 

Y1

Prevalence

Sensitivity

Specificity

PPV ( Positive Predictive Vlue)

NPV ( Negative Predictive value) adjusted

39.2

0.07

75.56

90.20

9.80

24.44

 2.4

 0.07

 

 

 

 

 

I need to repeat this process for each result of y1 ( 2.4 to 24000)  and append the results in a  dataset.

I was wondering if you can help to update my code to do this.

 

Thanks,

Contributor
Posts: 27

Re: Top 10 SAS Functions

Hi art297,

Using the same above dataset (kyphosis).

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a dataset. as below and repeat the process for each y1 in the dataset.



I was wondering if you can help to update my codes below to have the results I need.



Here is the codes I have used:



DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;



* Calculate Sensitivity and Specificity with 39.2 as a cut;





proc freq data = kyphosis order = formatted;

tables kyphosis * y11 / nocol nopercent;

run;



HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2



The FREQ Procedure

Frequency

Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

46

90.20

5

9.80

51



1

22

24.44

68

75.56

90



Total

68

73

141

The SAS System



APEND THE RESULTS AS FOLLOWING:



Y1 = 39.2

Sensitivity = 75.56

Specificity= 90.20

PPV ( Positive Predictive Value) = 9.80

NPV = 24.44

I need to repeat this process for each result of y1 ( 2.4 to 24000) and append the results in a dataset.


I was wondering if you can help to update my code to do this.



Thanks,
Trusted Advisor
Posts: 1,312

Re: Automating cross validation

I don't know about the rest of the program, but I would skip the proc rank and use    rand('table',...) function to randomly assign groups 0 to 9.  Here's how:

 

data kyphosis (drop=_:);
  set kyphosis nobs=nrecs;
  array needed {10} _temporary_;
  retain _nremain;
  if _n_=1 then do;
    _nremain=nrecs;
    do _col=1 to 10; needed{_col}=ceil(nrecs/10); end;
  end;

  call streaminit(01982066);
  array prb{10} _temporary_ ;
  do _col=1 to 10;  prb{_col}=needed{_col}/_nremain; end;

  rnd=rand('table',of prb{*});
  needed{rnd}=needed{rnd}-1;
  _nremain=_nremain-1;
run;

 

Moreover, you can do it for 500 variables at once:

data kyphosis (drop=_:);
  set kyphosis nobs=nrecs;
  array needed {500,10} _temporary_;
  retain _nremain;
  if _n_=1 then do;
    _nremain=nrecs;
    do _row=1 to 500;
      do _col=1 to 10; needed{_row,_col}=ceil(nrecs/10); end;
    end;
  end;
  call streaminit(01982066);
  array _prb{10} _temporary_;
  array rnd{500};
  do _row=1 to 500;
    do _col=1 to 10;  _prb{_col}=min(1,needed{_row,_col}/_nremain); end;
    rnd{_row}=rand('table',of _prb{*});
    needed{_row,rnd{_row}}=needed{_row,rnd{_row}}-1;
  end;
  _nremain=_nremain-1;
run;

 

This will eliminate 499 data steps and 500 proc ranks at the beginning of your script.

Contributor
Posts: 27

Re: Automating cross validation

Many thanks  

 

 

 

 

 

Ask a Question
Discussion stats
  • 10 replies
  • 257 views
  • 0 likes
  • 3 in conversation