## Solving OLS regression after cleaning empty columns

Dear all,

I wrote a simulation that about ten percent of the time end with a matrix data that contains one  'zero' column.

I identify those cases with   if abs(det(xpx))=0.

I clean the zero column and I try to calculate the OLS coefficients.

But unfortunately I get a matrix without the OLS coefficients.

The program works but I don't get the 10 percent  results matrix with data, even though I record the data files and there should be coefficients for the OLS. (data with the zero columns).

I don't understand why. I enclose the program. You can 'run' it and see the results.

1. Can someone understand why I get matrix without data for the 10 percent cases of the simulation?

2. Any suggestions how to correct that?

3. Maybe there is more efficient way to write this program that at the end I can get one matrix with also the 10 percent result, Ideas are welcome..

Orit

proc iml;

n=100;

b=j(6,n);

bpsl=j(5,n);

bnsl=j(5,n);

bptl=j(5,n);

bntl=j(5,n);

do j=1 to n;

xtl=j(240, 1);

xsl=j(240, 1);

ptl=j(240, 1);

psl=j(240, 1);

ntl=j(240, 1);

nsl=j(240, 1);

y=j(240, 1);

xtl[1,1]=-1.73754E-16;*/SAS avg obs*/;

xsl[1,1] =-1.12557E-15;

ptl[1,1]=0.0051111;

psl[1,1] =0.2604981;

ntl[1,1]=-0.0051111;

nsl[1,1] =-0.2604981;

y[1,1] =0.9783607;

a=0.18935 ;*first regression*;

k=0.83843;

ka=0.17226;

kb=-0.39250;

kc=-0.21518;

kd=5.16021;

c=-0.03537; *second regression*;

e=0.02979;

d=0.94756;

h=-2.04415;

m=-0.00034939; *third regression*;

l=0.00023025;

q=-0.00045077;

r=0.98906;

/** specify the mean and covariance of the population **/

Mean = {0, 0, 0};

Cov = {0.0467930604 0.0024632274 0.0000520215,

0.0024632274 0.0080371426 0.0000384574,

0.0000520215 0.0000384574 0.0000022165};

NumSamples = 240;

call randseed(0);  /** set seed for the RandNormal module **/

U = RandNormal(NumSamples, Mean, Cov);

do i=2 to 240;

Y[i,1]=a+k*y[(i-1),1]+ka*psl[(i-1),1]+kb*ptl[(i-1),1]+kc*nsl[(i-1),1]+kd*ntl[(i-1),1]+u[i,1];

xsl[i,1]=c+e*y[(i-1),1]+d*xsl[(i-1),1]+h*xtl[(i-1),1]+u[i,2];

xtl[i,1]=m+l*y[(i-1),1]+q*xsl[(i-1),1]+r*xtl[(i-1),1]+u[i,3];

if xsl[i,1]>=0 then do;

psl[i,1]=xsl[i,1];

nsl[i,1]=0;

end;

if xsl[i,1]<0 then do;

psl[i,1]=0;

nsl[i,1]=xsl[i,1];

end;

if xtl[i,1]>=0 then do;

ptl[i,1]=xtl[i,1];

ntl[i,1]=0;

end;

if xtl[i,1]<0 then do;

ptl[i,1]=0;

ntl[i,1]=xtl[i,1];

end;

end;

ylag=lag(y);

/* Step 1: Compute X`X and X`y */

x = j(nrow(y), 1, 1) || ylag || psl|| ptl ||nsl||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(x), v);  /* start with 1:240, exclude values in v */

xnew = x[idx, ];             /* extract submatrix                   */

xpx = xnew`* xnew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

xpy = xnew` * ynew;

if abs(det(xpx))=0 then do;

z1 = ynew || xnew;

create SINGULAR from z1[c={Y Ylag psl ptl nsl ntl}];

append from z1;

close SINGULAR;

end;

if abs(det(xpx))=0 then do;

if ssq(psl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || ptl ||nsl||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bpsl[,j]= solve(zpz, xpy);

end;

if ssq(ptl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl||nsl||ntl;             /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bptl[,j]= solve(zpz, xpy);

end;

if ssq(nsl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl|| ptl ||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = x[idx, ];             /* extract submatrix                   */

zpz = znew`* xnew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bnsl[,j]= solve(zpz, xpy);

end;

if  ssq(ntl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl|| ptl ||nsl;           /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(x), v);  /* start with 1:240, exclude values in v */

znew = x[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bntl[,j]= solve(zpz, xpy);

end;

end;

IF(abs(DET(xpx)))>0 then do;

*xpxi = inv(xpx); /* form inverse crossproducts */

*b[,j] = xpxi * xpy; /* solve for parameter estimates */

*b= solve(xpx, xpy);

*print b;

b[,j]= solve(xpx, xpy);

end;

end;

bt=b`;

varNames = "b0":"b5";

experiment="Sim1":"Sim100";

print bt[colname=varNames

rowname=experiment

label="Expanded Model"];

bpslt=bpsl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim100";

print bpslt[colname=varNames

rowname=experiment

label="Expanded Model"];

bnslt=bnsl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim100";

print bnslt[colname=varNames

rowname=experiment

label="Expanded Model"];

bptlt=bptl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim100";

print bptlt[colname=varNames

rowname=experiment

label="Expanded Model"];

bntlt=bntl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim100";

print bntlt[colname=varNames

rowname=experiment

label="Expanded Model"];

quit;

proc print data=singular;

run;

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: Solving OLS regression after cleaning empty columns

Here's an example of how I'd solve the problem without using any determinants or IF-THEN statements.

proc iml;
x = randnormal(100, j(1,6,0), I(6));
y = randnormal(100, 0, 1);
x[,3]=0; x[,6]=0;  /* set some columns to exact 0 */

xpx = x`*x;  /* normal equations */
xpy = x`*y;

/* Option 1: Use generalized inverse to solve (possibly singular) normal equations */
beta = ginv(xpx)* xpy;
print beta;

/* Option 2: delete columns of xpx that are exactly zero.
Set coeffs of deleted columns to missing */
beta = j(nrow(xpx), 1, .);  /* or you can initialize beta to zero */
dropVars = loc( xpx[##,]=0 );
keepVars = setdif(1:ncol(xpx), dropVars);
beta[keepVars] = solve(xpx[keepVars,keepVars], xpy[keepVars]);
print beta;

12 REPLIES 12

## Re: Solving OLS regression after cleaning empty columns

Try this: Make the program reproducible by using a positive seed to the  RANDSEED call, such as seed=1. You then can say something like

"During iteration j=3 [or whatever], my program computes [specify matrices and values that you don't understand].  I expect to see [what you think should happen].  Can anyone explain why I don't get what I expect?"

## Re: Solving OLS regression after cleaning empty columns

Thank you Rick for your tip. I'll try to be more clear.

I used Rick tip- I used in the RANDSEED call seed=1,

and during iteration 96 I got column of zero's in the variable 'ptl'.

I have to exclude that variable because the matrix is not invertible.

I catch those situations by the if statement   if abs(det(xpx))=0 then do;

And then I have to recognize that variable ( ssq(ptl)=0) and exclude him by:

if ssq(ptl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl || nsl || ntl;             /*Without ptl variable */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

v = {1};                   /* specify rows to exclude             */

idy = setdif(1:nrow(y), v);  /* start with 1:5, exclude values in v */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bptl[,j]= solve(zpz, xpy);

end;

But I get the matrix bptl  (bptlt=bptl`;) of coefficients OLS results empty in simulation 96.

I'll really appreciate any help,

Thanks,

Orit

## Re: Solving OLS regression after cleaning empty columns

I think you cut and pasted and forgot to change variable names.  Shouldn't the assignment statement be

bptl[,j]= solve(zpz, zpy);  /* ZPY, not XPY */

If this is the mistake, you've made it other places as well.

I'll also mention that you don't need to compute IDY.  IDX and IDY are the same.  This is fortunate because your program computes

ynew = y[idx, ];       /* should be IDY in subscript */

but it doesn't matter because IDX=IDY.

Your code assumes that you never have multiple zero columns. That is ssq(psl)=0 and ssq(ptl)=0 are assumed never to happen simultaneously.   If you can have multiple columns be zero simultaneously, you might consider treating all cases at once:

1) Form x matrix

2) compute ssq = x[##, ]; /* ssq for each column */

3) badCols = loc(ssq=0);  /* check for all zeros */

4) remove the bad columns, then compute parameter estimates for the good columns, etc

## Re: Solving OLS regression after cleaning empty columns

Dear Rick,

But, unfortunately SAS doesn't recognize this part of the program,( in the 'log' it seems that everything is o.k)

I tried to print the Z, ZPZ, ZPY matrices but I get "ERROR: Matrix z has not been set to a value"

even though in iteration 96 it should be with a value.

Why it doesn't create the Z matrix?  ( in the part that starts with:  if abs(det(xpx))=0 then do;)

Thank you!

Orit

enclose the corrected program:

proc iml;

n=100;

b=j(6,n);

bpsl=j(5,n);

bnsl=j(5,n);

bptl=j(5,n);

bntl=j(5,n);

do j=1 to n;

xtl=j(240, 1);

xsl=j(240, 1);

ptl=j(240, 1);

psl=j(240, 1);

ntl=j(240, 1);

nsl=j(240, 1);

y=j(240, 1);

xtl[1,1]=-1.73754E-16;*/SAS avg obs*/;

xsl[1,1] =-1.12557E-15;

ptl[1,1]=0.0051111;

psl[1,1] =0.2604981;

ntl[1,1]=-0.0051111;

nsl[1,1] =-0.2604981;

y[1,1] =0.9783607;

a=0.18935 ;*first regression*;

k=0.83843;

ka=0.17226;

kb=-0.39250;

kc=-0.21518;

kd=5.16021;

c=-0.03537; *second regression*;

e=0.02979;

d=0.94756;

h=-2.04415;

m=-0.00034939; *third regression*;

l=0.00023025;

q=-0.00045077;

r=0.98906;

/** specify the mean and covariance of the population **/

Mean = {0, 0, 0};

Cov = {0.0467930604 0.0024632274 0.0000520215,

0.0024632274 0.0080371426 0.0000384574,

0.0000520215 0.0000384574 0.0000022165};

NumSamples = 240;

call randseed(1);  /** set seed for the RandNormal module **/

U = RandNormal(NumSamples, Mean, Cov);

do i=2 to 240;

Y[i,1]=a+k*y[(i-1),1]+ka*psl[(i-1),1]+kb*ptl[(i-1),1]+kc*nsl[(i-1),1]+kd*ntl[(i-1),1]+u[i,1];

xsl[i,1]=c+e*y[(i-1),1]+d*xsl[(i-1),1]+h*xtl[(i-1),1]+u[i,2];

xtl[i,1]=m+l*y[(i-1),1]+q*xsl[(i-1),1]+r*xtl[(i-1),1]+u[i,3];

if xsl[i,1]>=0 then do;

psl[i,1]=xsl[i,1];

nsl[i,1]=0;

end;

if xsl[i,1]<0 then do;

psl[i,1]=0;

nsl[i,1]=xsl[i,1];

end;

if xtl[i,1]>=0 then do;

ptl[i,1]=xtl[i,1];

ntl[i,1]=0;

end;

if xtl[i,1]<0 then do;

ptl[i,1]=0;

ntl[i,1]=xtl[i,1];

end;

end;

ylag=lag(y);

/* Step 1: Compute X`X and X`y */

x = j(nrow(y), 1, 1) || ylag || psl|| ptl ||nsl||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(x), v);  /* start with 1:240, exclude values in v */

xnew = x[idx, ];             /* extract submatrix                   */

xpx = xnew`* xnew;                         /* cross-products       */

ynew = y[idx, ];             /* extract submatrix*/

xpy = xnew` * ynew;

if abs(det(xpx))=0 then do;

z1 = ynew || xnew;

create SINGULAR from z1[c={Y Ylag psl ptl nsl ntl}];

append from z1;

close SINGULAR;

end;

if abs(det(xpx))=0 then do;

if ssq(psl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || ptl ||nsl||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bpsl[,j]= solve(zpz, zpy);

end;

if ssq(ptl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl||nsl||ntl;             /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bptl[,j]= solve(zpz, zpy);

end;

if ssq(nsl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl|| ptl ||ntl;            /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(z), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* xnew;                         /* cross-products       */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bnsl[,j]= solve(zpz, zpy);

end;

if  ssq(ntl)=0 then do;

z = j(nrow(y), 1, 1) || ylag || psl|| ptl ||nsl;           /* add intercept column */

v = {1};                   /* specify rows to exclude             */

idx = setdif(1:nrow(x), v);  /* start with 1:240, exclude values in v */

znew = z[idx, ];             /* extract submatrix                   */

zpz = znew`* znew;                         /* cross-products       */

ynew = y[idx, ];             /* extract submatrix*/

zpy = znew` * ynew;

bntl[,j]= solve(zpz, zpy);

end;

end;

IF(abs(DET(xpx)))>0 then do;

*xpxi = inv(xpx); /* form inverse crossproducts */

*b[,j] = xpxi * xpy; /* solve for parameter estimates */

*b= solve(xpx, xpy);

*print b;

b[,j]= solve(xpx, xpy);

end;

end;

bt=b`;

varNames = "b0":"b5";

experiment="Sim1":"Sim10000";

print bt[colname=varNames

rowname=experiment

label="Expanded Model"];

bpslt=bpsl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim10000";

print bpslt[colname=varNames

rowname=experiment

label="Expanded Model"];

bnslt=bnsl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim10000";

print bnslt[colname=varNames

rowname=experiment

label="Expanded Model"];

bptlt=bptl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim10000";

print bptlt[colname=varNames

rowname=experiment

label="Expanded Model"];

bntlt=bntl`;

varNames = "b0":"b4";

experiment="Sim1":"Sim10000";

print bntlt[colname=varNames

rowname=experiment

label="Expanded Model"];

quit;

proc print data=singular;

run;

## Re: Solving OLS regression after cleaning empty columns

When you compute ssq(psl), ssq(ptl), etc, you are including the first observation of x in the computation. That observation was deleted when you formed xnew.

I think you want to examine whether any values of xnew[##,] are zero, or alternately ssq(psl[idx]).

## Re: Solving OLS regression after cleaning empty columns

Dear Rick,

I also used your advice and consider to treat all 'special' cases at once:

if abs(det(xpx))=0 then do;

ssq = xnew[##, ];

znew=xnew[ ,loc(ssq>0)];

zpz = znew`* znew;

zpy = znew` * ynew;

bpsl[,j]= solve(zpz, zpy);

end;

Even though it's really more short and efficient, the problem that I get coefficients but I don't know for which variables.

In the cases where if abs(det(xpx))=0, each time the 'zero' columns come from different variable.

For statistical checks it's really important that I know for each variable his coefficient.

Do you think there is a way in this efficient code that I can know for each coefficient his variable?

Thanks,

Orit

## Re: Solving OLS regression after cleaning empty columns

Yes, there is some bookkeeping that I left out that you would need to figure out.  Given that you are having lots of problems getting the code to run, I suggest that you ignore my suggestion and save the generalizations for another day.

## Re: Solving OLS regression after cleaning empty columns

Dear Rick,

Thank you for your great tips!

Given I have both of your books, maybe there are pages that you recommend me to read about the generalizations  ..?

Have a great day,

Orit

## Re: Solving OLS regression after cleaning empty columns

When you exclude a column, what value do you assign to the coefficient? Zero? One? Missing?

If a variable has the constant value 0, the corresponding regression coefficient can have any arbitrary value. I guess setting the coefficient to 'missing' makes the most sense?

## Re: Solving OLS regression after cleaning empty columns

Dear Rick,

Yes you right. When the variable is a column of zero the coefficient should be missing.

The matrix of the coefficients should be something like:

bo   b1   b2   b3   b4   b5     b6

sim1    0.5  0.2  0.3  .      0.8  0.9  0.05

sim2   0.5  0.2  0.3  0.8    0.8  .      0.05

sim3    0.5  0.2  0.3  .      0.8  0.9     .

sim4    0.5  0.2  0.3 0.7   0.8  0.9    0.33

etc..

I would really appreciate your opinion.

Thanks,

Orit

## Re: Solving OLS regression after cleaning empty columns

Here's an example of how I'd solve the problem without using any determinants or IF-THEN statements.

proc iml;
x = randnormal(100, j(1,6,0), I(6));
y = randnormal(100, 0, 1);
x[,3]=0; x[,6]=0;  /* set some columns to exact 0 */

xpx = x`*x;  /* normal equations */
xpy = x`*y;

/* Option 1: Use generalized inverse to solve (possibly singular) normal equations */
beta = ginv(xpx)* xpy;
print beta;

/* Option 2: delete columns of xpx that are exactly zero.
Set coeffs of deleted columns to missing */
beta = j(nrow(xpx), 1, .);  /* or you can initialize beta to zero */
dropVars = loc( xpx[##,]=0 );
keepVars = setdif(1:ncol(xpx), dropVars);
beta[keepVars] = solve(xpx[keepVars,keepVars], xpy[keepVars]);
print beta;

Brilliant !!

Thanks so much,

Orit

From The DO Loop