topic Re: TensorFlow MNIST in SAS/IML Software and Matrix Computations

TensorFlow MNIST

mcs — Tue, 13 Dec 2016 21:46:58 GMT

I'd like to implement TensorFlow's MNIST for ML Beginners in SAS. https://www.tensorflow.org/tutorials/mnist/beginners/ It classifies handwritten digits based on pixel intensities.

Their model is y = softmax(Wx + b), where

y = output vector of digit probabilities

W = weight matrix

x = pixel intensities

b = bias vector

They optimize W and b by minimizing cross-entropy, H = -Σy'log(y).

I've got some working SAS code, but it's not fast enough to do what I want yet. Using 10 principal components (instead of all the pixel values), and 10 training batches of 100 observations each, my code runs in about 2 minutes. I'd like to increase each of those parameteres by a factor of 10.

Here's the part of my code implementing their model..

wbv = j(1,10*&pc+10,0.1);			/* 10 weights per pc + 10 bias, vector format */
start ml(wbv) global(x,yp);			/* x is pc scores, yp is 1-hot digit labels */
	wb = shape(wbv,&pc+1,10);		/* weight and bias, matrix format */
	logy = x*wb-log(exp(x*wb)[,+]); 	/* y is softmax of evidence */
	ce = (-yp#logy)[:,+];			/* mean cross-entropy */
	return ce;
finish;

I'm using call nlpnra(rc,xrv,'ml',wbv) as the optimization routine.

Is there an easy way to get a big speedup?

Re: TensorFlow MNIST

Rick_SAS — Wed, 14 Dec 2016 01:54:37 GMT

What version of SAS/IML are you running? Which operating system?

The code you've posted is not very time intensive. It can be marginally improved, but not by 10-fold. Some ideas you can try:

1. Compute xwb = x*wb one time and use it in the formul for logy

2. Shape the wbv parameter vector once outside the function and pass it in, instead of reshaping it every call of the optimization.

3. Provide better initial guesses for the optimization routine. In particular, using all zeros or all ones is a bad idea. If you are doing an iterative method, use estimates from the previous iteration to seed the next iteration.See steps 3 and 4 of "Ten tips before you run an optimization"

4. If your matrices are large, make sure your RAM is large enough to hold multiple copies of these large matrices.

I recommend that that you use the TIME function to profile the various steps of the algorithm to fid out where it is spending most of the time. The I/O? The eigenvalue computation? Depending on what you find, see if you can optimize the bottleneck. The general code snippet is

t0 = time();

/* computation here */

tElapsed = time()-t0;

See also the general pricipals in the first three tips in the article "Eight tips to make your simulation run faster.

Re: TensorFlow MNIST

Ksharp — Wed, 14 Dec 2016 03:33:56 GMT

NLPQN (Dual) Quasi-Newton Method

could get you a little faster.

Re: TensorFlow MNIST

mcs — Wed, 14 Dec 2016 16:39:53 GMT

Thanks for the reply. I'm running SAS/IML 13.2 on Windows 7.

OK, I did this. It shaved 20 seconds or so off the two-minute run time.
The wbv parameter vector is the thing being optimized by proc nlpnra, and I think it has to be passed from nlpnra to ml as a vector. When I try to pass it as a matrix, I get ERROR: NLPNRA call: Error in argument X0.
OK, I did this. Surprisingly, it increased the run time by more than a minute.
My matrices aren't that big. My batch size (rows of x) and number of principal components (columns of x) will both be under 1000.

Regarding 3., I must be doing something wrong. Here's my do loop.

wbv = j(1,&pg*11,0.1);
do i = 1 to &nb;
	x[,1:&pc] = train[rows[i,],2:&pc+1];
	yp[,unique(train[rows[i,],1])+1] = design(train[rows[i,],1]);
	call nlpnra(rc,xrv,'ml',wbv);
	wbv = xrv;
	xr[i,] = xrv;
end;

Commenting out the line wbv = xrv, so that wbv always starts as a constant matrix of 0.1 instead of starting at the last solution, speeds up the code instead of slowing it down. Why would that be?

Re: TensorFlow MNIST

mcs — Wed, 14 Dec 2016 16:42:55 GMT

Thanks for your reply.

I can't figure out how to use NLPQN. When I submit

call nlpqn(rc,xrv,'ml',wbv);

I get the following errors.

ERROR: QUANEW Optimization cannot be completed.

ERROR: The function value of the objective function cannot be computed during the optimization process.

ERROR: Execution error as noted previously. (rc=100)

Re: TensorFlow MNIST

Rick_SAS — Wed, 14 Dec 2016 19:43:39 GMT

Regarding (3), I said "If you are doing an iterative method, use estimates from the previous iteration to seed the next iteration." It is hard to tell from your code, but it looks like you might be looking new data each time in the loop? If so, you might be using parameters from a bitmap that shows a "1" as an initial guess for a bitmap that shows an "8." That might explain the decrease in performance.

My point is to be smart about the initial guess. Often you can come up with heuristics that work better than a constant matrix.

Sorry about the wrong suggestion for (2). I think you are right: the argument should be a vector,

Re: TensorFlow MNIST

mcs — Wed, 14 Dec 2016 21:00:09 GMT

I am doing an interative method. At each step, I'm passing in a block of x values together with yp digit labels, and optimizing a weight matrix wb. Then I'm passing a new block of x and yp, and again optimizing wb, and so on. For 20 principal components (plus 1 constant), 10 digit labes, and 100 block size, the matrix math inside each optimization step looks like

y (100 rows, 10 columns) = softmax of x (100 rows, 20+1 columns) * wb (20+1 rows, 10 columns).

I think your suggestion in (3) makes sense. I would expect the wb matrix to be stable from block to block. It's supposed to generalize to out-of-sample data. That's why it surprises me that starting from the previous iteration is so much slower than starting from a constant.

Re: TensorFlow MNIST

Rick_SAS — Thu, 15 Dec 2016 12:47:04 GMT

You could do the following experiment: Let

diff = norm( x0 - xf )

be the vector norm of the difference between the initial guess (x) and the optimal parameter values (xf).

A plot of diff vs iteration number will tell you how close the initial guess was to the final guess for each step in the iteration.

If the optimal solution at step (i+1) is close to the optimal solution at step i, you should see small differences. You can do the experiment twice: once using a constant initial guess and once using the optimal solution at the previous iteration.