I am doing an interative method.  At each step, I'm passing in a block of x values together with yp digit labels, and optimizing a weight matrix wb.  Then I'm passing a new block of x and yp, and again optimizing wb, and so on.  For 20 principal components (plus 1 constant), 10 digit labes, and 100 block size, the matrix math inside each optimization step looks like 
   
 y (100 rows, 10 columns) = softmax of x (100 rows, 20+1 columns) * wb (20+1 rows, 10 columns). 
   
 I think your suggestion in (3) makes sense.  I would expect the wb matrix to be stable from block to block. It's supposed to generalize to out-of-sample data.  That's why it surprises me that starting from the previous iteration is so much slower than starting from a constant. 
						
					
					... View more