As many people here work with optimization e.g. for neural networks training in data compression, I would like to propose a general thread about such methods.
For example there are dominating 1st order methods like ADAM, while additionally modeling parabola would be great e.g. for estimating step size.
However, there are many difficulties regarding 2nd order methods due to huge dimension and noisy data, e.g.:
- size of Hessian is dim^2, we need to restrict to a subspace - how to choose it?
- how to estimate Hessian from noisy data, invert such noisy Hessian,
- naive Newton method attracts to close gradient=0 point ... which is usually a saddle - how to repair it? Many methods approximate Hessian as positive definite (e.g. Gauss-Newton, Fisher matrix in K-FAC, TONGA) - pretending that very nonconvex function is locally convex ...
Popular overview of 1st order methods: http://ruder.io/optimizing-gradient-descent/
Of 1st and 2nd order methods: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf