I am doing an interative method. At each step, I'm passing in a block of x values together with yp digit labels, and optimizing a weight matrix wb. Then I'm passing a new block of x and yp, and again optimizing wb, and so on. For 20 principal components (plus 1 constant), 10 digit labes, and 100 block size, the matrix math inside each optimization step looks like
y (100 rows, 10 columns) = softmax of x (100 rows, 20+1 columns) * wb (20+1 rows, 10 columns).
I think your suggestion in (3) makes sense. I would expect the wb matrix to be stable from block to block. It's supposed to generalize to out-of-sample data. That's why it surprises me that starting from the previous iteration is so much slower than starting from a constant.
... View more