in IML to remove observations from data set i can use
EDIT (data);
DELETE (observation);
PURGE;
CLOSE (data);
how to remove a VARIABLE?
do i need to transpose data so my variables become observations? is there a more efficient way?
many thanks in advance for your help!
ivs
So the end result is about N/2 pairs of variables? For a 6x6 matrix you might get
(X3, X4) [largest corr]
(X2, X6) [largest that doesn't include X3 or X4]
(X1, X5) [largest that doesn't include X3, X4, X2, or X6]
One possible approach is to not delete the rows or columns, but just set the elements of those rows and columns to missing. That way all the computations are done on the same matrix and it is easy to index into the variable names.
Because a correlation matrix is symmetric, I think you should delete two rows and two columns at each step.
I would feel bad if I took away the fun of solving this problem, so I'll just post some code that seems relevant and let you enjoy deciphering it and adapting it to your needs.
/* example corr matrix */
proc corr data=sashelp.cars noprint out=corr(where=(_type_="CORR")) ;
run;
proc iml;
use corr;
read all var _NUM_ into X[c=varNames];
close corr;
diagIdx = do(1, nrow(X)*ncol(X), ncol(x)+1);
x[diagIdx] = .;
start ind2sub( p, ind );
idx = colvec(ind);
n = nrow(idx);
col = 1 + mod(idx-1, p);
row = 1 + (idx-col) / p;
return ( row || col );
finish;
start PairCorr(x);
y = x; /* copy. We'll overwrite this with missing values */
k = floor(ncol(x)/2);
subscripts = j(k,2);
do i = 1 to k;
s = ind2sub(ncol(y), y[<:>]); *call function get (row,col) directly without looping;
y[ s, ] = .;
y[ , s] = .;
print i s y;
subscripts[i,] = s;
end;
return( subscripts );
finish;
pc = PairCorr(x);
results = varNames[ pc[,1] ] || varNames[ pc[,2] ];
print x results;
Although it is possible to use SAS/IML to perform some basic data set maintenance, or extensive maintenance taks most people use the DATA step, like so:
data A;
set sashelp.Class;
drop Age;
run;
But you don't need to physically remove a variable in order to exclude it from a PROC IML computation. Instead I use the DROP or KEEP statements to exclude it from being read when I read data into a matrix. For example, if I want all numerical variables EXCEPT the AGE variable read into a matrix, I could use any of the following statements:
use sashelp.class(drop=AGE);
read all var _NUM_ into X[c=VarNames];
close sashelp.class;
use sashelp.class(keep=Height Weight);
read all var _NUM_ into Y[c=VarNames];
close sashelp.class;
use sashelp.class;
read all var {Height Weight} into Z[c=VarNames];
close sashelp.class;
Thank you Rick! I know how to use DROP in data step or how to omit reading a variable into matrix. Unfortunately I can't see how to use these techniques in my project. I have a huge correlation matrix (actually just lower half of it), and I need to extract max item with corresponding row and column, then i need to drop this pair of variables out of the matrix and re-run the extraction. I am basically after a list of max pairwise correlations.
everything in my code is running smoothly (thanks to your input!) except for extraction of used up variables from the data. because the code is in IML i am looking for a way to drop a variable from data set in similar way as deleting an observation from a data set.
if you can think of a solution, i'd greatly appreciate it.
ivs
If your data is in a matrix then perhaps instead you mean how to delete a column from a matrix? Or if it is a correlation matrix how to delete a column and the corresponding row?
hi Tom, yes that's exactly what I am trying to do. Find max correlation in the matrix, save it and the pair of corresponding variables, then remove this pair of variables from the matrix and re-run the search for the mew max correlation. Thanks in advance if you know an efficient way to do that.
i found a way to delete rows and columns based on their intersection in the matrix:
start delrow(x,i);
return(x[setdif(1:nrow(x),i),]);
finish;
start delcol(x,i);
return(x[,setdif(1:ncol(x),i)]);
finish;
X=delcol(X,ColM); *here ColM is the col number of the intersection;
X=delrow(X,RowM); *here RowM is the row number of the intersection;
the problem with this approach is that i loose VarNames and row names.
i think i solved it through a macro variable. its not pretty, but it is working.
So the end result is about N/2 pairs of variables? For a 6x6 matrix you might get
(X3, X4) [largest corr]
(X2, X6) [largest that doesn't include X3 or X4]
(X1, X5) [largest that doesn't include X3, X4, X2, or X6]
One possible approach is to not delete the rows or columns, but just set the elements of those rows and columns to missing. That way all the computations are done on the same matrix and it is easy to index into the variable names.
Because a correlation matrix is symmetric, I think you should delete two rows and two columns at each step.
I would feel bad if I took away the fun of solving this problem, so I'll just post some code that seems relevant and let you enjoy deciphering it and adapting it to your needs.
/* example corr matrix */
proc corr data=sashelp.cars noprint out=corr(where=(_type_="CORR")) ;
run;
proc iml;
use corr;
read all var _NUM_ into X[c=varNames];
close corr;
diagIdx = do(1, nrow(X)*ncol(X), ncol(x)+1);
x[diagIdx] = .;
start ind2sub( p, ind );
idx = colvec(ind);
n = nrow(idx);
col = 1 + mod(idx-1, p);
row = 1 + (idx-col) / p;
return ( row || col );
finish;
start PairCorr(x);
y = x; /* copy. We'll overwrite this with missing values */
k = floor(ncol(x)/2);
subscripts = j(k,2);
do i = 1 to k;
s = ind2sub(ncol(y), y[<:>]); *call function get (row,col) directly without looping;
y[ s, ] = .;
y[ , s] = .;
print i s y;
subscripts[i,] = s;
end;
return( subscripts );
finish;
pc = PairCorr(x);
results = varNames[ pc[,1] ] || varNames[ pc[,2] ];
print x results;
hi Rick. This is very cool code. After trying for a few hours to add the corresponding correlation number to the output i am ready to give up. Please advise what's the best way to go about this last component. Many thanks in advance for your help!
ivs
hi Rick,
i solved it! not very pretty but it works. Here is your code with comments on how i understand it. Please let me know if its OK to concatenate M to s when reading do loop result into subscripts. Many thanks in advance!
proc iml;
use corr;
read all var _NUM_ into X[c=varNames];
close corr;
diagIdx = do(1, nrow(X)*ncol(X), ncol(x)+1); *change all diagonal 1s;
x[diagIdx] = .; *to missing;
start ind2sub( p, ind ); *declare function to determine row and column for max corr;
idx = colvec(ind);
n = nrow(idx);
col = 1 + mod(idx-1, p);
row = 1 + (idx-col) / p;
return ( row || col );
finish;
start PairCorr(x); *declare function to pick pairs of vars with max corrs;
y = x; /*copy of the corr matrix. We'll overwrite this with missing values */
k = floor(ncol(x)/2); *calculate number of pairs;
subscripts = j(k,3); *create matrix to store results of do loop;
do i = 1 to k;
M=max(y);
s = ind2sub(ncol(y), y[<:>]); *call function get (row col) directly without looping;
y[ s, ] = .; *set row with used up corr to missing;
y[ , s] = .; *set column with used up corr to missing;
subscripts[i,] = s || M; *here i am not sure if i am doing it correctly;
end;
return( subscripts );
finish;
pc = PairCorr(x);
results = varNames[ pc[,1] ] || varNames[ pc[,2] ] || char(pc[,3]); *here is also a question mark about if it is ok to convert value to character and use it;
print results;
create pairs from results;
append from results; *saved data looks good and I am happy with the result.
quit;
start ind2sub( p, ind ); *function to convert an index into a row and column subscript;
...
The other comments look fine. You probably want to keep the correlation as a numeric variable, so instead of converting pc[,3] to a character value, write it as a numeric value:
var1 = varNames[ pc[,1] ];
var2 = varNames[ pc[,2] ];
maxcorr = pc[,3];
create pairs var {var1 var2 maxcorr};
append;
close pairs;
great! thanks a million for all your help!
It was an interesting problem!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.