Results 1 to 23 of 23

Thread: Advanced optimization methods including SGD and 2nd order?

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts

    Advanced optimization methods including SGD and 2nd order?

    As many people here work with optimization e.g. for neural networks training in data compression, I would like to propose a general thread about such methods.
    For example there are dominating 1st order methods like ADAM, while additionally modeling parabola would be great e.g. for estimating step size.
    However, there are many difficulties regarding 2nd order methods due to huge dimension and noisy data, e.g.:
    - size of Hessian is dim^2, we need to restrict to a subspace - how to choose it?
    - how to estimate Hessian from noisy data, invert such noisy Hessian,
    - naive Newton method attracts to close gradient=0 point ... which is usually a saddle - how to repair it? Many methods approximate Hessian as positive definite (e.g. Gauss-Newton, Fisher matrix in K-FAC, TONGA) - pretending that very nonconvex function is locally convex ...
    Popular overview of 1st order methods: http://ruder.io/optimizing-gradient-descent/
    Of 1st and 2nd order methods: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf
    Any thoughts?

  2. The Following User Says Thank You to Jarek For This Useful Post:

    Shelwien (8th June 2019)

  3. #2
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    I tried using a second order method (the extended kalman filter) on the last layer of weights in PAQ8. I wrote about it in section 3.2.3 of my master's thesis: http://www.byronknoll.com/thesis.pdf

  4. The Following User Says Thank You to byronknoll For This Useful Post:

    Jarek (9th June 2019)

  5. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Thanks interesting, so you assume that parameters of the model are evolving with Gaussian transitions - why don't just use likelihood instead (as it directly corresponds to compression ratio)?
    For example in each step perform likelihood gradient ascend to adapt parameters to local situation.

  6. #4
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    I have had success with various approximate 2nd order methods like CG-Descent, etc (basically various forms of nonlinear conjugate gradient descent; http://users.clas.ufl.edu/hager/pape...cg_compare.pdf is one such algorithm; the authors also had a decent survey on various CG methods); L-BFGS and L-BFGS* are another option. These methods represent the Hessian implicitly (and assume p.d.). I have also used SR1 (Hessian is not necessarily p.d., but is symmetric), but only in small dimensions so I am not sure it would translate well. CG-Descent has worked for me in dimensions on the order of 200K-500K parameters and was very good at minimizing number of function evaluations.

    (Edit: Just to clarify, SR1 is assumed to converge faster to the true Hessian than CG or L-BFGS, that's the only relevance of it)
    Last edited by Stefan Atev; 10th June 2019 at 20:16.

  7. The Following User Says Thank You to Stefan Atev For This Useful Post:

    Jarek (10th June 2019)

  8. #5
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Jarek View Post
    Thanks interesting, so you assume that parameters of the model are evolving with Gaussian transitions - why don't just use likelihood instead (as it directly corresponds to compression ratio)?
    For example in each step perform likelihood gradient ascend to adapt parameters to local situation.
    I'm not sure if I can answer your question - my master's supervisor (Nando de Freitas) derived the equations and I am not very familiar with EKF. From what I understand: the loss function *is* cross entropy, which directly corresponds to compression ratio. I think the Gaussian transition function is separate from the loss function.

  9. #6
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    I think there is a connection between online optimization and recursive filters (EKF, Unscented filter, etc) - http://probabilistic-numerics.org/as...Bes_Miguez.pdf ; Unfortunately digging thorough control theory textbooks for me is like having a root canal, cannot offer a better explanation.

    (Edit: I just realized that the experiments in one of the first UKF papers is actually a neural network: https://www.seas.harvard.edu/courses.../unscented.pdf ; so using EKF and UKF for training neural networks is definitely an established idea)

  10. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    I have no experience with Kalman filters - will have to take a closer look.

    Stefan, both quasi-Newton (L-BFGS, SR1, DFP ..) and CG assume convexity - attract to saddles, and there is often exp(dim) of them in machine learning - isn't it a problem?
    CG seems unpopular in machine learning literature (?), but L-BFGS is quite common.
    However, beside the saddle attraction problem, it tries to estimate inverted Hessian based on recent ~10 stochastic gradients, which are very noisy - it seems numerically very unstable, what is also suggested by its requirement of line search (also CG).

    Here are some my thoughts on how to get successful 2nd order methods ( https://arxiv.org/pdf/1901.11457 ), I would gladly discuss:
    - perform 2nd order modelling only in a few (d << D) dimensional subspace - chosen e.g. by online PCA of recent gradients (suggesting where the action locally is). Additionally, we can still use the remaining (D-d) directions of gradients to simultaneously perform e.g. gradient descent for free there - combining multiple order methods,
    - control signs of curvatures to repel from saddles instead of attracting, what gives significant improvements shown by saddle-free Newton: https://arxiv.org/pdf/1406.2572
    - Hessian should be estimated online to have better local dependence, preferably from gradient sequence alone. It can be done e.g. by linear regression of gradients (as Hessian is their linear trend), by updating four exponential moving averages: of g, x, gx, xx. Sure quasi-Newton can also estimate from gradients, but from a few - making it numerically unstable. Exponential moving average, linear regression should be much better for extracting statistical trends from noisy gradients.

  11. #8
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    Yes, in a stochastic setting I have never seen anybody use CG . I personally think a good CG is on par with L-BFGS when using just a few vectors. I have used CG for non-convex problems, with the understanding that the local minima you end up is the one you get - I don't think any method can give you convergence guarantees on non-convex problems in general (except maybe in an "expected" sense). I think one of the papers I linked to basically links EKF methods with quasi-Newton; The UKF paper shows two ways of putting the network training problem into a state-space setting.

  12. #9
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    Quote Originally Posted by Jarek View Post

    - Hessian should be estimated online to have better local dependence, preferably from gradient sequence alone. It can be done e.g. by linear regression of gradients (as Hessian is their linear trend), by updating four exponential moving averages: of g, x, gx, xx. Sure quasi-Newton can also estimate from gradients, but from a few - making it numerically unstable. Exponential moving average, linear regression should be much better for extracting statistical trends from noisy gradients.
    I am not quite sure whether you're proposing this in the limited (d << D) case, where estimating the full Hessian is possible? In the D-dimensions, you can only afford an implicit Hessian (like CG, UKF), low-rank Hessian or inverse Hessian (limited memory quasi-Newton), or structured Hessian (if diagonal counts as "structure"; maybe for certain networks I can imagine a reasonable sparse Hessian). It seems that if you've already projected to a d-dimensional space, a lot of the noise will also be projected out, so maybe you don't need additional strategies to smooth the Hessian further.

    I could find several references on "stochastic" quasi-Newton, but I think none of them map directly to what you're trying to do - each was a specialized solution. The success of SGD and the importance of neural networks has not escaped notice in the numerical optimization circles, but sometimes the language is hard to translate between fields.

    In your opinion, what will be the benefit of "better" optimization methods - better generalization performance, or just better computational performance? For pure machine learning 1 is ultimately more important than 2, and I think the fact that sometimes more advanced optimization can lead to worse generalization performance has been a very frustrating realization for the deep network folks...

  13. #10
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Generally the landscape in ML problems is extremely complex. A standard "stop and estimate Hessian" approach requires maintaining such single Hessian model for a longer time.
    In contrast, inexpensive online estimation, especially based only on the stochastic gradient sequence, would allow for better dependence on local situation.

    In how many dimensions 2nd order model would be the most practical?
    - in a single direction (d=1): we would get momentum-like method with additional online modelling of parabola in its single direction - allowing for smarter step size, e.g. half the distance to minimum of such modeled parabola,
    - complete Hessian (d=D) is completely impractical for ML dimensions in millions. Additionally, most of its eigenvalues are close to zero due to over-parametrization, making huge the problem of Hessian inversion,

    - I think d~10 is a reasonable compromise: adds multidimensional model and simultaneous optimization to d=1 in a comparable computational cost.
    Choosing these directions by some online PCA of recent gradients, they point where the action really is: sparse directions with curvature far from zero.
    So we could update (e.g. online PCA) d~10 size nearly orthogonal basis (v_i), use projections of gradient on this basis to update model of Hessian there, and we can use what has left from this gradient (D-d dim projection) for simultaneous 1st order optimization like gradient descent or something more sophisticated.

    For (online) Hessian estimation, I think linear regression of gradients seems promising:
    So in 1D, parabola has gradient (derivative): g = lambda(x-p), we can get lambda from linear regression - optimizing least squares error for (noisy!) gradients -
    using four averages: of [g], [x], [xg] and [xx], we get lambda = ([gx] - [g][x])/([xx]-[x]^2).
    For non-parabola case, we can replace averages with exponential moving averages - weakening weights of the old noisy gradients.
    In multiple dimensions we analogously update 4 exponential moving averages, but [g] (like in momentum) and [x] are vectors, [xg], [xx] are matrices.
    Now linear regression Hessian estimation is from such covariance matrices:
    H = ([gx] - [g] [x]^T) ([xx] - [x] [x]^T)^-1
    Have you seen something like this in literature?

    Being "better" means mainly faster training, what is extremely important especially for deep NN ... but also not attracting to saddles, not stagnating on suboptimal plateous - e.g. ADAM is known to often lead to suboptimal solution, the saddle-free Newton paper shows that other methods lead to "minima" with lots of strong negative eigenvalues ... Generalization is a separate problem, but related.

  14. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I always wondered if its possible to use statistics to improve optimizer convergence.
    I mean, quite frequently we have to find optimal parameters of the same function, just with different input,
    so it should be possible to observe the optimization process and improve it based on stats?

  15. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Shelwien, 5 part answer
    1) SGD requires statistics extraction as we assume that gradients are noisy.

    2) This "optimal parameters of the same function, just with different input" sounds like context-dependent parametric models - neighboring thread: https://encode.ru/threads/3124-LOCO-...ny-alternatves

    3) Regarding "observe the optimization process and improve it based on stats", often optimum found for one problem is used as the initial point of optimizing for another problem. For example in computer vision to train for a bit different type of images.

    4) However, exploiting old optimization path for a new problem seems very problematic - the minimum of loss function is often not a single point, but huge due to:
    - symmetry - permutating parameters we often get analogous solution,
    - over-parametrization used in deep learning (# parameters >> data size, with regularization to focus on somehow smooth/reasonable models), makes that minima have mostly zero eigenvalues - making minimum practically a submanifold of nearly complete dimension - it seems tough to exploit useful statistical trends from old paths.

    5) However, there is a heuristics that initial weights should be chosen randomly - accordingly to statistics of a given type of NN ...

  16. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > often optimum found for one problem is used as the initial point of
    > optimizing for another problem.

    Yes, but that's too simple.
    I'm talking about something like observing that optimal values for some parameter
    are always in specific range... just with statistics.

    But tracking the optimal parameter values is not the only option -
    usually the optimization process itself would also have some constants/parameters.

    > there is a heuristics that initial weights should be chosen randomly,
    > accordingly to statistics of a given type of NN ...

    I try it sometime - these days I'm always running parameter optimization
    for some entropy models, so there're many options for experimenting.

    My current conclusion on random init is that it never works for me -
    optimizer always gets stuck at much worse results than what it finds
    starting from previous profiles, which evolved during model development.

    Of course, my optimizer is very far from gradient descent,
    its more like GA probably.

    Actually I did try to convert the entropy model to a function:
    https://encode.ru/threads/1522-Param...rate)-function
    But it mostly failed, as it wasn't precise enough, or fast enough.

  17. #14
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    You probably could maintain minimum while continuous transformation of parameter - e.g. just interlace tiny parameter step with gradient descent step (or use some implicit function optimization) ...
    The optimization has often many own (hyper-)parameters, starting with step size ... but ideally they often should be adapted along the optimization process ... maybe such optimization scheduling could be translated between problems ...

    Regarding random init, it is said that it's good to recreate original statistics - probability distributions of individual weights, maybe joint (I have a tool for joint: https://www.dropbox.com/s/7u6f2zpreph6j8o/rapid.pdf ) ...
    ... but previous result of optimization is probably a better initial point, maybe somehow perturbed.

    If by GA you mean gradient ascend (of likelihood), it is practically the same - just switch the sign.

    Regarding such parametric models, I have also started with just polynomials and it was disappointing ...
    but then I have added optimized powers e.g. |C-B|^0.8, |A-0.5|^4, |C-2B+D|^0.1 in this https://arxiv.org/pdf/1906.03238 ... and the performance has improved a lot.
    Generally, such e.g. power function would be put into a table, what allows to further optimize such tabled functions based on a dataset ...

  18. #15
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    I think ultimately you need to be able to answer why you think second order information would help you in a situation where you're operating away from the optimal solution most of the time. It may be over-simplifying things, but ADAM is also successful because of learning rate control, which is somewhat analogous to line search - making the most out of a descent direction faster (not taking steps that are too small). You do want individual batch updates to move you in the direction of the gradient (of the current batch) and likely orthogonal to previous gradients (to minimize how much you disturb the solution wrt previous batches). And that starts to look like CG

  19. #16
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    ADAM is momentum + strengthening underrepresented coordinates + maintaining inertia + bias for initializing exponential moving average ... nothing like line search.
    It doesn't try to model distance to minimum in considered direction - which is in linear trend of gradients, and can be cheaply added into consideration - to increase steps in plateau, decrease in steep valley.
    Adding online parabola model in considered direction is going toward line search ... without its additional cost if extracting this parabola from linear trend of gradients.

  20. #17
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    Quote Originally Posted by Jarek View Post
    ADAM is momentum + strengthening underrepresented coordinates + maintaining inertia + bias for initializing exponential moving average ... nothing like line search.
    It doesn't try to model distance to minimum in considered direction - which is in linear trend of gradients, and can be cheaply added into consideration - to increase steps in plateau, decrease in steep valley.
    Adding online parabola model in considered direction is going toward line search ... without its additional cost if extracting this parabola from linear trend of gradients.
    Any learning rate control is "analogous" to line search - it serves a similar purpose; I did not say it is the same. https://arxiv.org/pdf/1902.09843.pdf shows some ways to improve ADAM and it explicitly talks about correcting ADAM's overzealous learning rate. A line search is just one method of step size control, which may be too expensive or a wash to be worth it (eating a few extra function/gradient evaluations to do the search may be just as expensive as just doing a few more iterations). Using momentum is one way of saying that your past update may have been too small (though not the full story either).

    Anyway, the literature on optimization is too vast, and I am not sure that's the forum to discuss it in. My professional experience with optimization has always led to (ab)using the unique structure of a problem to specialize a general purpose algorithm almost beyond recognition; it may very well turn out that the same is true for training neural networks - different architectures / cost functions may turn out to "prefer" different optimization algorithms, and that's what we call "job security" in the industry

  21. #18
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    ADAM has a few heuristics, but does not try to evaluate distance to minimum, what requires e.g. 2nd order model: of parabola in this considered direction.
    We can get it practically for free from linear trend of gradients in the considered direction (e.g. by updating 4 exponential moving averages) - ask where this linear trend intersects 0.
    I wasn't able to find such approach in literature - have you seen something like this?

  22. #19
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    I am not sure I have seen this; The closest analog I can think of is the Barzilai-Borwein method, which people have already tried with SGD - it is a way to get step size without search. I personally have no experience with it. That gets you into non-monotonic methods (where individual iterations may make the cost or the norm of the gradient grow).

    (Edit: Here is a paper from very legit researchers on it: https://arxiv.org/pdf/1605.04131.pdf; again, I can only recommend it based on the authors, have not looked into it carefully).

  23. The Following User Says Thank You to Stefan Atev For This Useful Post:

    Jarek (11th June 2019)

  24. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Thanks, interesting - on the first look seems quite far, but I will look closer.
    To get linear trend of gradients from least squares linear regression we need to update 4 (exponential moving) averages (g, x, xg, xx) - if you could think of some paper using such averages?
    Momentum uses average of g only, TONGA of gg^T (uncentered covariance matrix), but I haven't met another optimizer exploiting especially this linear trend: xg average ...?

  25. #21
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    Quote Originally Posted by Jarek View Post
    Thanks, interesting - on the first look seems quite far, but I will look closer.
    I realize I am waving my hands in the air quite a lot here - BB is just the only method I know of that picks a step size for gradient descent that is not obviously a line search or a Newton-style method I have not seen a method that looks at the trend of the gradients to decide on a step size, but my experience is in more traditional (non-stochastic) optimization, where quasi-Newton/CG methods work quite well even if the problem is not quasi-convex, so I have had the luxury of not needing such trickery... Unfortunately trying to just do a search for "adaptive learning rate" is a very fraught exercise, there's way too much published, and a lot of it of unknown quality.

  26. #22
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    I was trying to find something similar in literature, discuss, ask people if they have seen something similar, especially the [xg] average ... but nothing.
    So it seems it is a novel approach, the question is if it is promising?
    It gives cheap online estimation of Hessian from gradient sequence only, which is optimal in MSE sense with exponentially weakening weights of the old gradients - seems exactly what is needed, I don't know any approach even close to its possibilities (?)
    The closest is quasi-Newton, but exponential moving average seems much more better way to extract statistics than estimating Hessian from a few recent noisy gradients.
    I know it needs some serious testing, but this is only a general template with lots of details to choose and I am a lone theoretician - learning python and testing it is one of my plans for the summer break.
    Anyway, I think it is at least worth some deeper investigation (?), would gladly discuss it, find a collaboration ...

  27. #23
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    I wanted to test the approach - learned basics of python, then went to tensorflow ... and got frustrated a bit, have to focus on something else now - will return in a few weeks.
    But I have prepared a more accessible paper this cycle - focused the simplest: d=1 case - SGD momentum method with online parabola model in this momentum direction to choose step size - finding linear trend of gradients in this direction using linear regression, what can be done by updating four (exponential moving) averages: of (theta, g, theta * g, theta^2) for theta, g being 1D projections of position and gradient: https://arxiv.org/pdf/1907.07063


    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	ogr.png 
Views:	6 
Size:	71.5 KB 
ID:	6722  

Similar Threads

  1. Run-Length Encoding EXE GUI implementation with advanced settings
    By CompressMaster in forum Data Compression
    Replies: 1
    Last Post: 14th July 2018, 22:36
  2. Advanced Huffman Encoding
    By Simon Berger in forum Data Compression
    Replies: 28
    Last Post: 15th April 2009, 14:24
  3. Advanced counter/predictor
    By encode in forum Data Compression
    Replies: 22
    Last Post: 27th May 2008, 23:43
  4. Advanced Lazy Matching
    By encode in forum Data Compression
    Replies: 15
    Last Post: 8th May 2008, 00:29
  5. parsing methods
    By Piotr Tarsa in forum Forum Archive
    Replies: 18
    Last Post: 9th August 2007, 06:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •