Hi there,
I am writing a code to perform k-fold CV using proc iml in SAS studio.
My data is stored in a 47x15 matrix, and each column here represents one Principal Component (i.e. column one is PC1 and column 15 is PC15 and I have 47 rows/obs). I want to apply linear regression and get the cross-validated MSE for 15 different models, i.e., I start with the model with only PC1 and I add one more PC at each iteration. For each model, I perform 5-fold CV to get the cross-validated MSE. In R, I could just use cv.lm, but since I could not find a function to perform CV inside SAS IML, I decided to code it myself.
THE PROBLEM IS: I created a vector of 47 obs using sample without replacement (i.e., the 47 numbers appear in random order). From this vector I was able to create a test set index and a training set index. I then apply this index to the model matrix X. When X has only one predictor (i.e. PC1), this works just fine and I am able to create the randomly selected test and train sets. However, when X has more than one predictor (e.g. PC1 PC2), SAS returns an ERROR message: (execution) Invalid subscript or subscript out of range.
I spent literally hours today trying to figure this out. I included some prints in my code to help the debug, but it seems very weird to me that this is not working for more than one var.
Can someone please help?
Here's the piece of code which is problematic ( starts in trainset=x[train_ind,]; )... again it's working for i=2 (one predictor) but not for i>2 (more than one predictor).
k=5;							 /* number of cv folds*/
do i = 2 to 16;
	mse_acum=0;
	x = pcs_matrix[,2:i];
	y = pcs_matrix[,1];
	n = nrow(x);                     /* sample size         */
	aux= 1:n;						 /* vector from 1:n     */
	call randseed(1);		     /* reproducible results*/
	s = sample(aux, n, "WOR"); /* sampling without replacement */
	do j = 1 to (k-1); /* creating train and test sets and regressing */
		test_ind = s[((j-1)#ceil(n/k) + 1):(j#ceil(n/k))];
		print test_ind;
		train_ind = setdif(s,test_ind);
		print train_ind;
		trainset=x[train_ind,];
		print trainset;
		testset =x[test_ind,];
		print testset;
		testset_y = y[test_ind];
		trainset_y = y[train_ind];
		run Regress_cv;
		mse_acum=mse_acum + mse;
	end;
	/* for the last fold since n it's not always divisible by k do:*/
	test_ind = s[((k-1)#ceil(n/k) + 1):n];
	train_ind = setdif(s,test_ind);
	testset =x[test_ind,];
	trainset=x[train_ind,];
	testset_y = y[test_ind];
	trainset_y = y[train_ind];
	run Regress_cv;
	mse_acum=mse_acum + mse;
	print mse_acum;
	r2_cross[i-1] = 1 - mse_acum#n/ssquares; 
end;I appreciate your help!
Nice code. For someone who started using SAS a few weeks ago, you have made great progress.
Before I discuss the error, let me show you how the SAS log helps you find it. When you run the program, the log shows something like this:
747 train_ind = setdif(s,test_ind);
748 print train_ind;
749 trainset=x[train_ind, ];
750 print trainset;
751 testset =x[test_ind, ];
752 print testset;
753 testset_y = y[test_ind];
754 trainset_y = y[train_ind];
755 run Regress_cv;
756 mse_acum=mse_acum + mse;
757 end;
...
769 end;
ERROR: (execution) Invalid subscript or subscript out of range.
operation : [ at line 749 column 17
 operands : x, train_ind,
x 37 rows 2 cols (numeric)
train_ind 1 row 37 cols (numeric)
statement : ASSIGN at line 749 column 7
Now look at the red line near the bottom of the log. It says that the invalid subscript occurred in the '[' operation (subscript) on line 749, column 17 of the log. If you look at the line numbers in the log, you discover that Line 749 is
749 trainset=x[train_ind, ];
If you also print 'j' (the iteration number) in the loop, you discover that the problem occurs when j=2. The question is therefore why the X variable doesn't have enough rows to support the subscript operation for iteration 2?
The answer is that the Regress_cv module does not have any arguments and therefore all variables in the module are GLOBAL variables! (See the doc.) In particular, there is a variable x in the module that overwrites the symbol of the same name in your program.
One solution is to make the module use a local symbol table by passing in the training and test data set. For example, you might define the module as
start Regress_cv(_trainset, _testset); 
  k = ncol(_trainset);
  trainset_y = _trainset[,1];  trainset = _trainset[,2:k];
  testset_y = _testset[,1];  testset = _testset[,2:k];
  ntrain = nrow(trainset);           /* trainset size         */
  c = j(ntrain, 1, 1);             /* column vector of 1s */
  x = c||trainset;              /* concatenating       */
  y = trainset_y;
  xpxi = inv(x`*x);               /* inverse of X'X      */
  beta = xpxi * (x`*y);           /* parameter estimate  */
  ntest = nrow(testset);                /* testset size         */
  a = j(ntest, 1, 1);              /* column vector of 1s */
  testset = a||testset;
  yhat = testset*beta;            /* prediction on testset*/
  resid = testset_y-yhat;         /* residuals           */
  sse = ssq(resid);               /* SSE                 */
  dfe = nrow(x)-ncol(x);          /* error DF            */
  mse = sse/dfe;                  /* MSE                 */
  return mse;
finish Regress_cv;   Notice that I've passed in the whole train/test data. You could pass in 4 args if you prefer to subdivide the data in the main program. Using this new module, call it like this:
do i = 2 to 16;
   mse_acum=0;
   x = pcs_matrix[,2:i];
   y = pcs_matrix[,1];
   n = nrow(x);                     /* sample size         */
   ssquares= ssq(y-sum(y)/n);
   aux= 1:n;                   /* vector from 1:n     */
   call randseed(1);         /* reproducible results*/
   s = sample(aux, n, "WOR"); /* sampling without replacement */
   do j = 1 to (k-1); /* creating train and test sets and regressing */
      test_ind = s[((j-1)#ceil(n/k) + 1):(j#ceil(n/k))];
      train_ind = setdif(s,test_ind);
      trainset=pcs_matrix[train_ind, 1:i];
      testset =pcs_matrix[test_ind, 1:i];
      mse = Regress_cv(trainset, testset);
      mse_acum=mse_acum + mse;
   end;
   /* for the last fold since n it's not always divisible by k do:*/
   test_ind = s[((k-1)#ceil(n/k) + 1):n];
   train_ind = setdif(s,test_ind);
   trainset=pcs_matrix[train_ind, 1:i];
   testset =pcs_matrix[test_ind, 1:i];
   mse = Regress_cv(trainset, testset);
   mse_acum=mse_acum + mse;
   print mse_acum;
   *r2_cross[i-1] = 1 - mse_acum#n/ssquares; 
end;
For more about local and global symbols in SAS/IML, see the article "Understanding local and global variables in the SAS/IML language".
Calling @Rick_SAS
You should post it at IML forum.
The following code is I wrote before for K-Fold-CV for proc logistic.
/****** K-Fold CV ****/
%macro k_fold_cv(k=10);
ods select none;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
 set have(where=(groupid ne &i)) ;
run;
data test;
 set have(where=(groupid eq &i));
run;
ods output 
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
 outest=est(keep=_status_ _name_) ;
 class sex;
 model status(event='Alive')=sex height weight;
 score data=test fitstat; 
run;
data score&i;
 merge true native est;
 retain id &i ;
 optimism=native-true;
run;
%end;
data k_fold_cv;
 set score1-score&k;
run;
ods select all;
%mend;
%k_fold_cv(k=10)
/*************************************/
%macro k_fold_cv_rep(r=1,k=10);
ods select none;
%do r=1 %to &r;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
 set have(where=(groupid ne &i)) ;
run;
data test;
 set have(where=(groupid eq &i));
run;
ods output 
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
 outest=est(keep=_status_ _name_) ;
 class sex;
 model status(event='Alive')=sex height weight;
 score data=test fitstat; 
run;
data score_r&r._&i;
 merge true native est;
 retain rep &r id &i;
 optimism=native-true;
run;
%end;
%end;
data k_fold_cv_rep;
 set score_r:;
run;
ods select all;
%mend;
%k_fold_cv_rep(r=20,k=10);
/********************/
data all;
 set k_fold_cv k_fold_cv_rep indsname=indsn;
 length indsname $ 32;
 indsname=indsn;
run;
proc summary data=all nway;
 class indsname;
 var optimism;
 output out=want mean=mean lclm=lclm uclm=uclm;
run;
Hi Ksharp,
I am not familiar with macros ( I started using SAS a couple of weeks ago actually).
I'll have a look to see if I can adapt to my case here!
Thanks a lot!
Please either post a runnable program and data or post the log. that will help us determine where the error is occurring.
Hi Rick,
Please check the entire code + data attached here. The problem lies on the last part (check the last proc iml; ).
I appreciate your help!
Nice code. For someone who started using SAS a few weeks ago, you have made great progress.
Before I discuss the error, let me show you how the SAS log helps you find it. When you run the program, the log shows something like this:
747 train_ind = setdif(s,test_ind);
748 print train_ind;
749 trainset=x[train_ind, ];
750 print trainset;
751 testset =x[test_ind, ];
752 print testset;
753 testset_y = y[test_ind];
754 trainset_y = y[train_ind];
755 run Regress_cv;
756 mse_acum=mse_acum + mse;
757 end;
...
769 end;
ERROR: (execution) Invalid subscript or subscript out of range.
operation : [ at line 749 column 17
 operands : x, train_ind,
x 37 rows 2 cols (numeric)
train_ind 1 row 37 cols (numeric)
statement : ASSIGN at line 749 column 7
Now look at the red line near the bottom of the log. It says that the invalid subscript occurred in the '[' operation (subscript) on line 749, column 17 of the log. If you look at the line numbers in the log, you discover that Line 749 is
749 trainset=x[train_ind, ];
If you also print 'j' (the iteration number) in the loop, you discover that the problem occurs when j=2. The question is therefore why the X variable doesn't have enough rows to support the subscript operation for iteration 2?
The answer is that the Regress_cv module does not have any arguments and therefore all variables in the module are GLOBAL variables! (See the doc.) In particular, there is a variable x in the module that overwrites the symbol of the same name in your program.
One solution is to make the module use a local symbol table by passing in the training and test data set. For example, you might define the module as
start Regress_cv(_trainset, _testset); 
  k = ncol(_trainset);
  trainset_y = _trainset[,1];  trainset = _trainset[,2:k];
  testset_y = _testset[,1];  testset = _testset[,2:k];
  ntrain = nrow(trainset);           /* trainset size         */
  c = j(ntrain, 1, 1);             /* column vector of 1s */
  x = c||trainset;              /* concatenating       */
  y = trainset_y;
  xpxi = inv(x`*x);               /* inverse of X'X      */
  beta = xpxi * (x`*y);           /* parameter estimate  */
  ntest = nrow(testset);                /* testset size         */
  a = j(ntest, 1, 1);              /* column vector of 1s */
  testset = a||testset;
  yhat = testset*beta;            /* prediction on testset*/
  resid = testset_y-yhat;         /* residuals           */
  sse = ssq(resid);               /* SSE                 */
  dfe = nrow(x)-ncol(x);          /* error DF            */
  mse = sse/dfe;                  /* MSE                 */
  return mse;
finish Regress_cv;   Notice that I've passed in the whole train/test data. You could pass in 4 args if you prefer to subdivide the data in the main program. Using this new module, call it like this:
do i = 2 to 16;
   mse_acum=0;
   x = pcs_matrix[,2:i];
   y = pcs_matrix[,1];
   n = nrow(x);                     /* sample size         */
   ssquares= ssq(y-sum(y)/n);
   aux= 1:n;                   /* vector from 1:n     */
   call randseed(1);         /* reproducible results*/
   s = sample(aux, n, "WOR"); /* sampling without replacement */
   do j = 1 to (k-1); /* creating train and test sets and regressing */
      test_ind = s[((j-1)#ceil(n/k) + 1):(j#ceil(n/k))];
      train_ind = setdif(s,test_ind);
      trainset=pcs_matrix[train_ind, 1:i];
      testset =pcs_matrix[test_ind, 1:i];
      mse = Regress_cv(trainset, testset);
      mse_acum=mse_acum + mse;
   end;
   /* for the last fold since n it's not always divisible by k do:*/
   test_ind = s[((k-1)#ceil(n/k) + 1):n];
   train_ind = setdif(s,test_ind);
   trainset=pcs_matrix[train_ind, 1:i];
   testset =pcs_matrix[test_ind, 1:i];
   mse = Regress_cv(trainset, testset);
   mse_acum=mse_acum + mse;
   print mse_acum;
   *r2_cross[i-1] = 1 - mse_acum#n/ssquares; 
end;
For more about local and global symbols in SAS/IML, see the article "Understanding local and global variables in the SAS/IML language".
Good luck with your project.
If I may offer a small suggestion, I like to put the starting/ending indices that I loop over into arrays:
startIdx = ceil( (0:k-1)#ceil(n/k) + 1 );
endIdx = startIdx[,2:ncol(startIdx)]-1 || n;
do j = 1 to k-1;
   test_ind = s[, startIdx[j]:endIdx[j] ];
   ...
end;This will prevent you from having extra code at the end "for the last fold". You can do all the work in a single loop, which is a little cleaner.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
