Re: Top 10 SAS Functions

jeka1212 · Posted 02-21-2018 05:10 PM

Hi all,

The macro below perform 10 fold cross validation. The macro compute the predicted probabilities in each of test data generated and append the results in the predicted dataset.

I need to repeat the same process 500 times (currently the process is run 1 time). I was if you would be able to advise how I can modify the macro so that I can repeat this process.

Note: the data and the codes are below:


data kyphosis;
set kyphosis;
theRandom = ranuni(86);
run;

*Then, divide the dataset into 10 groups based on the random number;
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;


proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;

/*STEP 2: Repeat the following 10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/

/*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
sure that there is not already a dataset called predicted.*/

proc datasets library = work nodetails nolist;
delete predicted;
run;
quit;

/*The MACRO*/

%macro runit;
%do x = 0 %to 10; * asks SAS to repeat the steps 10 times;
proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;
data training&x.;
set kRanked;
where theRandom ne &x;
run;
data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;

proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;
score data= test&x. out = predicted&x.;
run;
proc append base = predicted data = predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%mend runit;
%runit;

My dataset called kyphosis can be found below

y1	kyphosis
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1

art297 · Posted 02-21-2018 06:43 PM

The title of your question is totally misleading.

Which process do you want to repeat 500 times? Run the macro 500 times or do the entire process 500 times?

Since you initially select a random sequence based on a specific seed, do you want to do that same thing 500 times, or replace it with a different seed each time.

I presume that you want to generate 500 different output files, but you haven't specified what you want.

Art, CEO, AnalystFinder.com

jeka1212 · Posted 02-22-2018 03:14 PM

Sorry for not being clear.

Yes - I want to generate 500 different output files.

so to do that I need to do the entire process 50 times, i.e, 1 process generates 10 different output files.

art297 · Posted 02-22-2018 04:02 PM

You could just add an outer loop within your current macro. However, if you use the same seed (like you did), all 50 replications will be identical. As such, I moved the random selection into the macro, and used a seed of 0.

Also, I presume you only want 50 files and that the ten files created each time can be overwritten.

Check to see if the following does what you want:

data kyphosis;
  infile cards dlm='09'x;
  input y1	kyphosis;
  cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/


/*The MACRO*/
%macro runit;
  %do i=1 %to 50;
    data kyphosis;
      set kyphosis;
      theRandom = ranuni(0);
    run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
  /*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
  sure that there is not already a dataset called predicted.*/

  proc datasets library = work nodetails nolist;
    delete predicted&i.;
  run;
  quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

proc logistic data = kRanked outmodel = model&x.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;

data training&x.;
set kRanked;
where theRandom ne &x;
run;

data test&x.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;


proc logistic inmodel = model&x.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&x.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;

Art, CEO, AnalystFinder.com

jeka1212 · Posted 02-22-2018 11:02 PM

Many thanks. I really appreciated your help.

So in my case I want to keep all 500 files and that the ten files created each time need to be kept for further analysis.

I was wondering if you can advise me how to keep those 500 files?

Many thanks

art297 · Posted 02-22-2018 11:29 PM

The following (not tested though) keeps all 500 files, as well as all 500 models and training datasets:

data kyphosis;
  infile cards dlm='09'x;
  input y1	kyphosis;
  cards;
28	0
15.5	0
8.2	0
3.4	0
17.3	0
15.2	0
32.9	0
11.1	0
87.5	0
16.2	0
107.9	0
5.7	0
25.6	0
31.2	0
21.6	0
55.6	0
8.8	0
6.5	0
22.1	0
14.4	0
44.2	0
3.7	0
7.8	0
8.9	0
18	0
6.5	0
4.9	0
10.4	0
5	0
5.3	0
6.5	0
6.9	0
8.2	0
21.8	0
6.6	0
7.6	0
15.4	0
59.2	0
5.1	0
10	0
5.3	0
32.6	0
4.6	0
6.9	0
4	0
3.65	0
7.8	0
32.5	0
11.5	0
4	0
10.2	0
2.4	1
719	1
2106.667	1
24000	1
1715	1
3.6	1
521.5	1
1600	1
454	1
109.7	1
23.7	1
464	1
9810	1
255	1
58.7	1
225	1
90.1	1
50	1
5.6	1
4070	1
592	1
28.6	1
6160	1
1090	1
10.4	1
27.3	1
162	1
3560	1
14.7	1
83.3	1
336	1
55.7	1
1520	1
3.9	1
5.8	1
8.45	1
361	1
369	1
8230	1
39.3	1
43.5	1
361	1
12.8	1
18	1
9590	1
555	1
60.2	1
21.8	1
900	1
6.6	1
239	1
3100	1
3275	1
682	1
85.4	1
10290	1
770	1
247.6	1
12320	1
113.1	1
1079	1
45.6	1
1630	1
79.4	1
508	1
3190	1
542	1
1021	1
235	1
251	1
3160	1
479	1
222	1
15.7	1
2540	1
11630	1
1810	1
6.9	1
4.1	1
15.6	1
9820	1
1490	1
15.7	1
45.8	1
7.8	1
12.8	1
100.5333	1
227	1
70.9	1
2500	1
;

/*STEP 2: Repeat the following 50*10 times:
i. Fit a logistic regression model on 9/10 of your data (the training dataset) and hold aside the other 1/10 as the test dataset.
ii. Use the fitted model to calculate the predicted probability of kyphosis=1 for each observation in the test dataset.
iii. Store these predicted probabilities in a new dataset, "predicted".*/


/*The MACRO*/
%macro runit;
  %let counter=1;
  %do i=1 %to 50;
    data kyphosis;
      set kyphosis;
      theRandom = ranuni(0);
    run;

*Then, divide the dataset into 10 groups based on the random number*/
/*Open the dataset kRanked to verify that each observation is ranked 0 to 9 (10 even groups).*/;

proc rank data=kyphosis out = kRanked groups = 10;
var theRandom;
run;
  /*PS: Since we will later be appending observations onto the dataset ‘predicted’, we want to make
  sure that there is not already a dataset called predicted.*/

  proc datasets library = work nodetails nolist;
    delete predicted&i.;
  run;
  quit;

%do x = 0 %to 9; * asks SAS to repeat the steps 10 times;

%let counter=%eval(&counter+1);

proc logistic data = kRanked outmodel = model&counter.; *Fit the logistic model on 9/10 of the data, and
output the model into the dataset model0 (when x=0), model1 (when x=1), etc.;
model kyphosis (event="1") = y1;
where theRandom ne &x; * Omit 1/10 of the data (eg, when x=0, omit the observations where theRandom=0).;
run;


data training&counter.;
set kRanked;
where theRandom ne &x;
run;

data test&counter.; * Put the omitted data into the test dataset, called test0 (when x=0), test1 (when x=1), etc.;
set kRanked;
where theRandom = &x;
run;


proc logistic inmodel = model&counter.; *Apply the logistic model to the test dataset and put
the predicted probabilities into a dataset predicted0, predicted1, etc.;

score data= test&x. out = _predicted&x.;
run;
proc append base = predicted&i. data = _predicted&counter.; *Keep adding the predicted values to a single dataset ‘predicted.’ ;
run;
%end;
%end;
%mend runit;

%runit;

Art, CEO, AnalystFinder.com

jeka1212 · Posted 02-23-2018 12:45 PM

Many thanks. It does indeed give me what I wanted.

Much appreciated your help

jeka1212 · Posted 02-28-2018 09:55 AM

Hi art297 and all,

Using the same above dataset (kyphosis).

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a dataset. as below and repeat the process for each y1 in the dataset.

I was wondering if you can help to update my codes below to have the results I need.

Here is the codes I have used:

DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;

* Calculate Sensitivity and Specificity with 39.2 as a cut;

proc freq data = kyphosis order = formatted;

tables kyphosis * y11 / nocol nopercent;

run;

HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2

The FREQ Procedure

Frequency

Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

46

90.20

5

9.80

51

1

22

24.44

68

75.56

90

Total

68

73

141

The SAS System

APEND THE RESULTS AS FOLLOWING:

Y1	Prevalence	Sensitivity	Specificity	PPV ( Positive Predictive Vlue)	NPV ( Negative Predictive value) adjusted
39.2	0.07	75.56	90.20	9.80	24.44
2.4	0.07

I need to repeat this process for each result of y1 ( 2.4 to 24000) and append the results in a dataset.

I was wondering if you can help to update my code to do this.

Thanks,

jeka1212 · Posted 02-28-2018 03:59 PM

Hi art297,

Using the same above dataset (kyphosis).

I need to calculate sensitivity, specificity for each result of y1 ( 2.4 to 24000) , append the results in a dataset. as below and repeat the process for each y1 in the dataset.

I was wondering if you can help to update my codes below to have the results I need.

Here is the codes I have used:

DATA kyphosis;

set kyphosis;

** Create a binary variable;

if y1 <= 39 then y11=0;

else y11=1;

run;

* Calculate Sensitivity and Specificity with 39.2 as a cut;

proc freq data = kyphosis order = formatted;

tables kyphosis * y11 / nocol nopercent;

run;

HERE IS THE RESLTS OF SENSITIVITY AND SPECIFICITY FOR Y1 = 39.2

The FREQ Procedure

Frequency

Row Pct

Table of kyphosis by y11

kyphosis

y11

0

1

Total

0

46

90.20

5

9.80

51

1

22

24.44

68

75.56

90

Total

68

73

141

The SAS System

APEND THE RESULTS AS FOLLOWING:

Y1 = 39.2

Sensitivity = 75.56

Specificity= 90.20

PPV ( Positive Predictive Value) = 9.80

NPV = 24.44

I need to repeat this process for each result of y1 ( 2.4 to 24000) and append the results in a dataset.

I was wondering if you can help to update my code to do this.

Thanks,

mkeintz · Posted 02-28-2018 05:54 PM

I don't know about the rest of the program, but I would skip the proc rank and use rand('table',...) function to randomly assign groups 0 to 9. Here's how:

data kyphosis (drop=_:);
  set kyphosis nobs=nrecs;
  array needed {10} _temporary_;
  retain _nremain;
  if _n_=1 then do;
    _nremain=nrecs;
    do _col=1 to 10; needed{_col}=ceil(nrecs/10); end;
  end;

  call streaminit(01982066);
  array prb{10} _temporary_ ;
  do _col=1 to 10;  prb{_col}=needed{_col}/_nremain; end;

  rnd=rand('table',of prb{*});
  needed{rnd}=needed{rnd}-1;
  _nremain=_nremain-1;
run;

Moreover, you can do it for 500 variables at once:

data kyphosis (drop=_:);
  set kyphosis nobs=nrecs;
  array needed {500,10} _temporary_;
  retain _nremain;
  if _n_=1 then do;
    _nremain=nrecs;
    do _row=1 to 500;
      do _col=1 to 10; needed{_row,_col}=ceil(nrecs/10); end;
    end;
  end;
  call streaminit(01982066);
  array _prb{10} _temporary_;
  array rnd{500};
  do _row=1 to 500;
    do _col=1 to 10;  _prb{_col}=min(1,needed{_row,_col}/_nremain); end;
    rnd{_row}=rand('table',of _prb{*});
    needed{_row,rnd{_row}}=needed{_row,rnd{_row}}-1;
  end;
  _nremain=_nremain-1;
run;

This will eliminate 499 data steps and 500 proc ranks at the beginning of your script.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jeka1212 · Posted 03-01-2018 09:08 AM

Many thanks mkeintz,

My query was related to using the existing the kyphosis above to calculate sensitivity and specificity and append the results. The attached file cal illustrate much better what I am trying to achieve.

Any help?

Thanks

Automating cross validation