12-13-2016 04:46 PM
I'd like to implement TensorFlow's MNIST for ML Beginners in SAS. https://www.tensorflow.org/tutorials/mnist/beginners/ It classifies handwritten digits based on pixel intensities.
Their model is y = softmax(Wx + b), where
y = output vector of digit probabilities
W = weight matrix
x = pixel intensities
b = bias vector
They optimize W and b by minimizing cross-entropy, H = -Σy'log(y).
I've got some working SAS code, but it's not fast enough to do what I want yet. Using 10 principal components (instead of all the pixel values), and 10 training batches of 100 observations each, my code runs in about 2 minutes. I'd like to increase each of those parameteres by a factor of 10.
Here's the part of my code implementing their model..
wbv = j(1,10*&pc+10,0.1); /* 10 weights per pc + 10 bias, vector format */ start ml(wbv) global(x,yp); /* x is pc scores, yp is 1-hot digit labels */ wb = shape(wbv,&pc+1,10); /* weight and bias, matrix format */ logy = x*wb-log(exp(x*wb)[,+]); /* y is softmax of evidence */ ce = (-yp#logy)[:,+]; /* mean cross-entropy */ return ce; finish;
I'm using call nlpnra(rc,xrv,'ml',wbv) as the optimization routine.
Is there an easy way to get a big speedup?
12-13-2016 08:54 PM
What version of SAS/IML are you running? Which operating system?
The code you've posted is not very time intensive. It can be marginally improved, but not by 10-fold. Some ideas you can try:
1. Compute xwb = x*wb one time and use it in the formul for logy
2. Shape the wbv parameter vector once outside the function and pass it in, instead of reshaping it every call of the optimization.
3. Provide better initial guesses for the optimization routine. In particular, using all zeros or all ones is a bad idea. If you are doing an iterative method, use estimates from the previous iteration to seed the next iteration.See steps 3 and 4 of "Ten tips before you run an optimization"
4. If your matrices are large, make sure your RAM is large enough to hold multiple copies of these large matrices.
I recommend that that you use the TIME function to profile the various steps of the algorithm to fid out where it is spending most of the time. The I/O? The eigenvalue computation? Depending on what you find, see if you can optimize the bottleneck. The general code snippet is
t0 = time();
/* computation here */
tElapsed = time()-t0;
See also the general pricipals in the first three tips in the article "Eight tips to make your simulation run faster.
12-14-2016 11:39 AM
Thanks for the reply. I'm running SAS/IML 13.2 on Windows 7.
Regarding 3., I must be doing something wrong. Here's my do loop.
wbv = j(1,&pg*11,0.1);
do i = 1 to &nb; x[,1:&pc] = train[rows[i,],2:&pc+1]; yp[,unique(train[rows[i,],1])+1] = design(train[rows[i,],1]); call nlpnra(rc,xrv,'ml',wbv); wbv = xrv; xr[i,] = xrv; end;
Commenting out the line wbv = xrv, so that wbv always starts as a constant matrix of 0.1 instead of starting at the last solution, speeds up the code instead of slowing it down. Why would that be?
12-14-2016 02:42 PM - edited 12-14-2016 02:43 PM
Regarding (3), I said "If you are doing an iterative method, use estimates from the previous iteration to seed the next iteration." It is hard to tell from your code, but it looks like you might be looking new data each time in the loop? If so, you might be using parameters from a bitmap that shows a "1" as an initial guess for a bitmap that shows an "8." That might explain the decrease in performance.
My point is to be smart about the initial guess. Often you can come up with heuristics that work better than a constant matrix.
Sorry about the wrong suggestion for (2). I think you are right: the argument should be a vector,
12-14-2016 03:54 PM - edited 12-14-2016 04:00 PM
I am doing an interative method. At each step, I'm passing in a block of x values together with yp digit labels, and optimizing a weight matrix wb. Then I'm passing a new block of x and yp, and again optimizing wb, and so on. For 20 principal components (plus 1 constant), 10 digit labes, and 100 block size, the matrix math inside each optimization step looks like
y (100 rows, 10 columns) = softmax of x (100 rows, 20+1 columns) * wb (20+1 rows, 10 columns).
I think your suggestion in (3) makes sense. I would expect the wb matrix to be stable from block to block. It's supposed to generalize to out-of-sample data. That's why it surprises me that starting from the previous iteration is so much slower than starting from a constant.
12-15-2016 07:47 AM
You could do the following experiment: Let
diff = norm( x0 - xf )
be the vector norm of the difference between the initial guess (x) and the optimal parameter values (xf).
A plot of diff vs iteration number will tell you how close the initial guess was to the final guess for each step in the iteration.
If the optimal solution at step (i+1) is close to the optimal solution at step i, you should see small differences. You can do the experiment twice: once using a constant initial guess and once using the optimal solution at the previous iteration.
12-14-2016 11:42 AM
Thanks for your reply.
I can't figure out how to use NLPQN. When I submit
I get the following errors.
ERROR: QUANEW Optimization cannot be completed.
ERROR: The function value of the objective function cannot be computed during the optimization process.
ERROR: Execution error as noted previously. (rc=100)