Statistical programming, matrix languages, and more

~1mln regressions

Reply
Occasional Contributor
Posts: 10

~1mln regressions

Hi,

I need (and want) to solve ~1 mln regressions in SAS IML (in loop, of course) and it takes ~40 minutes.

Is there possible way to speed it up? Or the SAS IML language is not the best solution for this kind of analysis?

I know that in fortran it takes less then 20 sec.

I would be grateful for assistance in this matter.

Respected Advisor
Posts: 3,782

~1mln regressions

Why are you doing regression with IML?  Perhaps it is not the regression that is slow but some other housekeeping action that is not optimized.

You don't give much info, you don't show any code, or data so I can only guess.

Guessing I would say use PROC REG.  It should be much faster.

Occasional Contributor
Posts: 10

Re: ~1mln regressions

Hi,

I would like to implement Shapley Value Regression in SAS IML and i'm testing if it is a right way by looping regression for 20 predictors  (2^20 regressions).

Code is simple -> looping from i to 2^20 with 1500 obs.

Code (data imported from dataset):

n=1000000;

b=j(n,1,0);

do i=1 to n;

b[i,1]=inv(x`*x)*x`*y;

end;

Respected Advisor
Posts: 3,782

Re: ~1mln regressions

BERENZ wrote:

Hi,

I would like to implement Shapley Value Regression in SAS IML and i'm testing if it is a right way by looping regression for 20 predictors  (2^20 regressions).

Code is simple -> looping from i to 2^20 with 1500 obs.

Code (data imported from dataset):

n=1000000;

b=j(n,1,0);

do i=1 to n;

b[i,1]=inv(x`*x)*x`*y;

end;

More questions one possible answer.

Doesn't the expression in the do loop yield the same result every time?  Is this really your program?

This should optimize inv(x`x) x`y

      b[i,1]=ginv(x)*y;

Occasional Contributor
Posts: 10

Re: ~1mln regressions

Hi,

program form aboce is only for tests to check how much time it takes SAS to loop over such number of regressions. The whole program will be:

a) generate matrix for regressions, with this code:

proc iml;

start makro(n);

          rows=2##n;

          x=j(rows,n);

          do j=1 to n;

                    PatLength=2##(n-j);

                    PatRepl=2##(j-1);

                    pattern=j(PatLength,1,0)//j(PatLength,1,1);

                    x[,j]=repeat(pattern,PatRepl);

          end;

          return(x);

finish;

pred=20; /* number of predictors */

x=makro(pred);

quit;

b) then do 2^n regressions and for each calculate R^2 with and without each predictor.

c) for each predictor calculate SV with equation in this document: http://marketing.gfkamerica.com/website/articles/ShapelyValueRegression.pdf

But, for now I don't have it all in SAS code. I

SAS Super FREQ
Posts: 3,615

Re: ~1mln regressions

One suggestion: solve the normal equations more efficiently by using the SOLVE function instead of using the INV function to form the general inverse. See the Programming Tip on p. 51 of Statistical Programming with SAS/IML Software. You can download Chapter 2 of the book for free at the books Web site.

The module that you post is from a previous Discussion Forum post and my blog (although you've deleted my helpful comments and did not cite the source). For details, interested readers can see http://blogs.sas.com/iml/index.php?/archives/66-Creating-a-Matrix-with-All-Combinations-of-Zeros-and... .

The rows of this matrix (call it m) enumerate the 2##p subsets of the p variables in the model. The first row is the intercept-only model. For the other rows, set Xi = X[,loc(m[i,])] and solve the i_th regression problem with the data matrix Xi.

This is the brute force method; there are more efficient ways (such as not storing all those zeros/ones, but rather computing them in the same loop that does the regression). Notice, that 2#20 is more than 1 million, so even if you spend only 0.01 seconds on each regression, we're talking about 10,000 seconds = 175 minutes = 3 hours. If you are getting preliminary results in 40 minutes, I'm impressed. (I guess it depends on the size of your data.)

Occasional Contributor
Posts: 10

Re: ~1mln regressions

Hi,

I appologise for not posting the source, of course I used your code for generate this matrix. And your blog was very helpful in many cases Smiley Wink

Thanks for your sugestions. I will have some time for implement this algorithm and will post my efficiency results.

I wonder if it is posible in sas iml to have such results as in frotran or as I asked in first post, iml or 4gl language will not allow to have similar results.

N/A
Posts: 1

Re: ~1mln regressions

Hi guys,

I am trying to set up a matrix of 29 columns and n rows from temp data set  by using the proc IML.  My next step is that I need to transpose the matrix Please help me  to understand what I am doing wrong

use work.matr_2

read all var (ID Type Code Day Var 1.....Var 24) into rm;

read all var (Day) into rm;

n=NROW (rm);

m=NCOL (rm);

d= J(n,1,m);

c=d;

d=rm[ . ,29]/rm[. ,1];

c=rm[., 29]-rm[. , 2];

x = rm || d||c;

append from x;

quit;

SAS Super FREQ
Posts: 3,615

Re: ~1mln regressions

You are overwriting the rm matrix (from the first READ stmt) with the second READ stmt.

If you are going to read DAY, just use READ ALL VAR {DAY} and it creates a vector named DAY. (But DAY is already the 4th column of RM, so....)

You can use the T() function to make a transpose.

I also note that you are trying to use the APPEND stmt, but you haven't used the CREATE stmt to create a data set.

I suggest you download the FREE chapter of my book Statistical Programming with SAS/IML Programming.

To learn how to read and write data sets, go to my blog http://blogs.sas.com/content/iml and click on the tags in the right-hand column called "Getting Started" and "Reading and Writing Data." Of particular interest:

Transposing matrices: http://blogs.sas.com/content/iml/2011/01/17/on-the-flip-side-exchanging-rows-and-columns/

Reading data: http://blogs.sas.com/content/iml/2010/10/04/reading-sas-data-sets/

Writing data: http://blogs.sas.com/content/iml/2011/04/18/writing-data-from-a-matrix-to-a-sas-data-set/

Occasional Contributor
Posts: 10

Re: ~1mln regressions

Hi,

here is my code for Shapley Value and sample data http://www.megaupload.com/?d=7WN5AS2W

and info about Shapley Value: http://marketing.gfkamerica.com/website/articles/ShapelyValueRegression.pdf

For 20 regressors it took ~4 h and for more regressors I can't perform analysis...

Is it possible to speed up my analysis?

/*declaring variables*/

%let x1=V6015 V6016 V6017 V6018 V6019 V6020;

proc iml;

use sample_data;

read all var {y} into y;

read all var {&x1} into x;

/*generating 0-1 matrix*/

/*source of makro - Rick Wicklin, SAS Institute*/

/*http://blogs.sas.com/content/iml/2011/01/05/creating-a-matrix-with-all-combinations-of-zeros-and-one...

start makro(n);

          rows=2##n;

          x=j(rows,n);

          do j=1 to n;

                    PatLength=2##(n-j);

                    PatRepl=2##(j-1);

                    pattern=j(PatLength,1,0)//j(PatLength,1,1);

                    x[,j]=repeat(pattern,PatRepl);

          end;

          return(x);

finish;

zero=makro(ncol(x));

/*constant for 0-1 matrix and x*/

const1=j(nrow(zero),1,1);

const2=j(nrow(x),1,1);

/*attach constants*/

x=x||const2;

zero=zero||const1;

/*solving linear equation for all regressors*/

A=x`*x;

c=x`*y;

b=solve(A,c);

y_hat=x*b;

OSK=t(y)*y-1/nrow(x)*sum(y);

RSK=t(y_hat)*y_hat-1/nrow(x)*sum(y_hat);

R2_all=RSK/OSK;

n=ncol(x);

/*matrix for Shapley Value*/

sv=j(n,1,0);

/*loop thru variables*/

do j=1 to ncol(x)-1;

/*matrix with selected variable*/

zero_with=zero[loc(zero[,j]),];

/*matrix without selected variable*/

zero_without=zero[loc(zero[,j]=0),];

/*loop thru regressions*/

          do i=1 to nrow(zero_with);

/*                    gamma from equation*/

                    gamma=fact(sum(zero[i,]))#fact(ncol(x)-sum(zero[i,])-1)/fact(ncol(x));

/*                    matrix for x with and without selected variable*/

                    x_with=x#zero_with[i,];

                    x_without=x#zero_without[i,];

 

 

/*                    matrix x with selected variables*/

                    x_with=x_with[,loc(x_with[+,])];

                    x_without=x_without[,loc(x_without[+,])];

 

/*                    solving linear equations*/

                    A_with=x_with`*x_with;

                    A_without=x_without`*x_without;

                    c_with=x_with`*y;

                    c_without=x_without`*y;

                    b_with=solve(A_with,c_with);

                    b_without=solve(A_without,c_without);

 

                    y_hat_with=x_with*b_with;

                    y_hat_without=x_without*b_without;

                    OSK_with=t(y)*y-1/nrow(x_with)*sum(y);

                    OSK_without=t(y)*y-1/nrow(x_without)*sum(y);

                    RSK_with=t(y_hat_with)*y_hat_with-1/nrow(x_with)*sum(y_hat);

                    RSK_without=t(y_hat_without)*y_hat_without-1/nrow(x_without)*sum(y_hat_without);

                    R2_with=RSK_with/OSK_with;

                    R2_without=RSK_without/OSK_without;

 

/*                    shapley value for each variable*/

                    sv[j,]=sv[j,]+gamma#(R2_with-R2_without);

          end;

end;

/*net effect for all variables*/

NE=j(ncol(x),1,0);

C=0;

n1=ncol(x)-1;

x1=x[,1:n1];

corr=corr(x1,"spearman");

do i=1 to ncol(x1);

          do j=1 to ncol(x1);

                    if i=j then C=C;

                    else C=C+b[i,1]*corr[i,j]*b[j,1];

          end;

          NE[i,1]=b[i,1]*2+C;

end;

a1=sv[+,];

a2=NE[+,];

sv1=sv/a1#100;

NE1=NE/a2#100;

rn={&x1 const};

print R2_all;

print b[Rowname=rn] sv sv1 NE NE1;

quit;

SAS Super FREQ
Posts: 3,615

Re: ~1mln regressions

I read the paper this weekend, but I only glanced at your code. Here are some comments and suggestions:

1) This is a VERY expensive method, as it requires the R-square value for the evaluation of all 2^p models in the p variables.  The reason certain techniques such as all-subset regression succeed is that they can prune the set of possible models so that they don't actually evaluate all possible models. For this method, the paper doesn't mention a shortcut. This means that this kind of analysis won't be practical for models that examine many tens of variables.

2) I agree with data_null that PROC REG is the way to generate all the R-square values. Use the SELECTION=RSQUARE option on the model statement is use ODS OUTPUT SubsetSelSummary=RSquareValues;  That should produce the R-square values fairly quickly. Here's some code:

ods graphics off;

proc reg data=sashelp.cars;

ods select SubsetSelSummary;

ods output SubsetSelSummary=RsquareValues;

model mpg_highway = mpg_city cylinders enginesize horsepower/

   selection=rsquare;

run;

3) Unfortunately, the models are given by importance, rather than in order of the variables in the data set. I think the next step is to sort the values by NumInModel and VarsInModel. Then (and this might be tough) read the sorted data set into IML and use it to figure out the Shapley Values. (It seems like this step will be simplest if the variables are renamed x1-x20.)

4) I did find a complicated macro that seems to implement this approach, but I don't understand the macro well enough to know whether to recommend it. http://spalsasandexcel.blogspot.com/2009/07/shapley-value-macrosas.html  If it works, great. If not, maybe a careful analysis of the macro will give you some more hints,

Occasional Contributor M_K
Occasional Contributor
Posts: 14

Re: ~1mln regressions

hi Rick,

thats really great information. do you know any code to read the sorted data set into IML to figure out Shapley Value?

Thanks!

SAS Super FREQ
Posts: 3,615

Re: ~1mln regressions

The USE and READ statements read data from SAS data sets: http://blogs.sas.com/content/iml/2010/10/04/reading-sas-data-sets/

N/A
Posts: 1

Re: ~1mln regressions

Here is my code to compute shapley value using Proc Reg selection=Rsquare and Proc IML procedure as suggested by Rick Wicklin

%let dsn=;  /*data set name*/

%let indepvar=  ; /*list of independent variables separated with space*/

%let depvar=; /*dependent variable*/

%macro Shapley_value_regression;

/*Count No of Independent Cariable*/

%local nvar list;

%let nvar=1;

%let list=%qscan(&indepvar,&nvar,%str( ));

%do %while(&list ne);

%let nvar=%eval(&nvar+1);

%let list=%qscan(&indepvar,&nvar,%str( ));

%end;

%let nvar=%eval(&nvar-1);

data _temp_;

  set &dsn;

  keep &indepvar &depvar;

run;

/*Rename Variables*/

data _temp_;

set  _temp_;

  rename

  %do i=1 %to &nvar;

  %let var&i=%sysfunc(scan(&indepvar,&i,' '));

  &&var&i =x&i %end ; ;

run;

/*Reg with selection=rsquare*/

proc reg data=_temp_;

  ods select SubsetSelSummary;

  ods output SubsetSelSummary=RsquareValues;

  model &depvar=x1-x&nvar. /selection=rsquare;

run;

data RsquareValues;

set RsquareValues;

  drop Dependent Model control Modelindex;

  data Null_Model;

  NumInModel =0;

  Rsquare=0;

  VarsInModel='x0';

proc append base=RsquareValues data=null_model;

proc sort data=RsquareValues;

  by NumInModel VarsInModel;

run;

/*Shapley Value computation*/

%do i=1 %to &nvar;

  data R2_with_&i.( rename=(RSquare=Rsquare_with VarsInModel=x_with))

  R2_without_&i. (drop =NumInModel rename=(RSquare=Rsquare_without VarsInModel=x_without));

  set RsquareValues;

  if find(VarsInModel,"x&i") gt 0 then output R2_with_&i.;

  else output R2_without_&i.;

  run;

  data R2_delta_&i.;

  set  R2_with_&i.;

  set  R2_without_&i. ;

  run;

  data R2_delta_&i. (keep=x&i. NumInModel );

  set  R2_delta_&i.;

  x&i.=Rsquare_with-Rsquare_without;

  run;

%end;

data R2_delta;

merge %do i=1 %to &nvar; R2_delta_&i. %end; ;

  rename

  %do i=1 %to &nvar;

  %let var&i=%sysfunc(scan(&indepvar,&i,' '));

  x&i=&&var&i  %end ; ;

run;

proc iml;

  use R2_delta;

  read all var {&indepvar} into delta;

  read all var {NumInModel} into NIM;

  p=ncol(delta);

  q=nrow(delta);

  sv=j(q,p,0);

   do j=1 to p;

      do i=1 to q;

   do m=1 to NIM[i,];

    gamma=0;

    w=fact(NIM[i,]-1)#(fact(p-(NIM[i,]-1)-1));

    w=w/fact(p);

    gamma=gamma+w;

    sv[i,j]=gamma#delta[i,j];

    end;

       end;

   end;

  sv=SV[+,]`;

  RSquare=sv[+,];

  Rname={&indepvar};

  Cname={SV  Var };

  print SV [rowname=Rname colname=Cname] ;

  print RSquare ;

  create shapley_value var {Rname sv };

  append;

  close shapley_value;

quit;

%MEND Shapley_value_regression;

%Shapley_value_regression;

proc print data=shapley_value;run;

for details check this: http://predictive-analytics-world.blogspot.in/2013/06/shapley-value-regression-sas-macro.html

N/A
Posts: 1

Re: ~1mln regressions

Hi,

Can I have SAS code for Shapley Value Regression for PROC LOGISTIC??? I need to run Shapley Value regression using pseudo r square of logistic regression.

Ask a Question
Discussion stats
  • 14 replies
  • 1627 views
  • 0 likes
  • 7 in conversation