Solved: how to delete a VARIABLE in a data set?

ivs · Posted 04-19-2013 03:34 PM

in IML to remove observations from data set i can use

EDIT (data);

DELETE (observation);

PURGE;

CLOSE (data);

how to remove a VARIABLE?

do i need to transpose data so my variables become observations? is there a more efficient way?

many thanks in advance for your help!

ivs

Rick_SAS · Posted 04-19-2013 07:40 PM

So the end result is about N/2 pairs of variables? For a 6x6 matrix you might get

(X3, X4) [largest corr]

(X2, X6) [largest that doesn't include X3 or X4]

(X1, X5) [largest that doesn't include X3, X4, X2, or X6]

One possible approach is to not delete the rows or columns, but just set the elements of those rows and columns to missing. That way all the computations are done on the same matrix and it is easy to index into the variable names.

Because a correlation matrix is symmetric, I think you should delete two rows and two columns at each step.

I would feel bad if I took away the fun of solving this problem, so I'll just post some code that seems relevant and let you enjoy deciphering it and adapting it to your needs.

/* example corr matrix */
proc corr data=sashelp.cars noprint out=corr(where=(_type_="CORR")) ;
run;
proc iml;

use corr;
read all var _NUM_ into X[c=varNames];
close corr;
diagIdx = do(1, nrow(X)*ncol(X), ncol(x)+1);
x[diagIdx] = .;

start ind2sub( p, ind );
   idx = colvec(ind);
   n = nrow(idx);
   col = 1 + mod(idx-1, p);
   row = 1 + (idx-col) / p;
   return ( row || col );
finish;

start PairCorr(x);
   y = x; /* copy. We'll overwrite this with missing values */
   k = floor(ncol(x)/2);
   subscripts = j(k,2);
   do i = 1 to k;
      s = ind2sub(ncol(y), y[<:>]); *call function get (row,col) directly without looping;
      y[ s, ] = .;
      y[ , s] = .;
      print i s y;
      subscripts[i,] = s;
   end;
   return( subscripts );
finish;

pc = PairCorr(x);
results = varNames[ pc[,1] ] || varNames[ pc[,2] ];
print x results;

View solution in original post

Rick_SAS · Posted 04-19-2013 03:48 PM

Although it is possible to use SAS/IML to perform some basic data set maintenance, or extensive maintenance taks most people use the DATA step, like so:

data A;

set sashelp.Class;

drop Age;

run;

But you don't need to physically remove a variable in order to exclude it from a PROC IML computation. Instead I use the DROP or KEEP statements to exclude it from being read when I read data into a matrix. For example, if I want all numerical variables EXCEPT the AGE variable read into a matrix, I could use any of the following statements:

use sashelp.class(drop=AGE);
read all var _NUM_ into X[c=VarNames];
close sashelp.class;

use sashelp.class(keep=Height Weight);
read all var _NUM_ into Y[c=VarNames];
close sashelp.class;

use sashelp.class;

read all var {Height Weight} into Z[c=VarNames];

close sashelp.class;

ivs · Posted 04-19-2013 06:00 PM

Thank you Rick! I know how to use DROP in data step or how to omit reading a variable into matrix. Unfortunately I can't see how to use these techniques in my project. I have a huge correlation matrix (actually just lower half of it), and I need to extract max item with corresponding row and column, then i need to drop this pair of variables out of the matrix and re-run the extraction. I am basically after a list of max pairwise correlations.

everything in my code is running smoothly (thanks to your input!) except for extraction of used up variables from the data. because the code is in IML i am looking for a way to drop a variable from data set in similar way as deleting an observation from a data set.

if you can think of a solution, i'd greatly appreciate it.

ivs

Tom · Posted 04-19-2013 06:23 PM

If your data is in a matrix then perhaps instead you mean how to delete a column from a matrix? Or if it is a correlation matrix how to delete a column and the corresponding row?

ivs · Posted 04-19-2013 06:33 PM

hi Tom, yes that's exactly what I am trying to do. Find max correlation in the matrix, save it and the pair of corresponding variables, then remove this pair of variables from the matrix and re-run the search for the mew max correlation. Thanks in advance if you know an efficient way to do that.

i found a way to delete rows and columns based on their intersection in the matrix:

start delrow(x,i);

return(x[setdif(1:nrow(x),i),]);

finish;

start delcol(x,i);

return(x[,setdif(1:ncol(x),i)]);

finish;

X=delcol(X,ColM); *here ColM is the col number of the intersection;

X=delrow(X,RowM); *here RowM is the row number of the intersection;

the problem with this approach is that i loose VarNames and row names.

ivs · Posted 04-19-2013 07:32 PM

i think i solved it through a macro variable. its not pretty, but it is working.

Rick_SAS · Posted 04-19-2013 07:40 PM