Abstract:
The second-order Taylor polynomial provides a local, “bowl-shaped” approximation of a given function. This approximation serves as the foundation for curvature-based methods. These include in particular (i) optimization techniques based on the Newton step - the step into the minimum of the local approximation - as well as (ii) the Laplace approximation. The latter allows arbitrary probability distributions to be approximated by simpler Gaussian distributions.
Such curvature-based approaches offer opportunities to improve existing machine learning methods. Training modern models on large data sets is often computationally expensive. Newton-based approaches could make a valuable contribution to improving training efficiency. Curvature-based approaches are also useful in the context of approximate inference: The Laplace approximation, for instance, is used for inference in non-conjugate Gaussian processes and for quantifying uncertainty in neural network predictions.
Despite their potential, curvature-based methods are still only used to a limited extent in machine learning. This is primarily due to the fact that these algorithms require second-order derivatives, whose evaluation incurs high computational costs. In addition, the Taylor polynomial is in some cases not computable accurately, but only accessible in the form of a stochastic estimate. The overarching goal of this work is to address these limitations in order to further unlock the benefits of curvature-based methods in machine learning. Specifically, the thesis presents three contributions.
We first introduce the iterative inference method IterNCGP for non-conjugate Gaussian processes, which is based on the Laplace approximation. Our method uses “incomplete” computations - enabling inference on large data sets - but captures the resulting approximation error as an additional source of uncertainty. This allows our approach to trade off reduced computational costs for increased uncertainty. Furthermore, by reusing computationally expensive intermediate results, the inference process is accelerated significantly.
Second, we present the open-source library ViViT, which provides efficient access to curvature information in the context of neural networks. Our approach exploits the low-rank structure of the generalized Gauss-Newton matrix. This enables efficient computation of eigenvalues and eigenvectors, as well as slopes and curvatures along those directions. Unlike existing methods, our approach explicitly exposes the noise in the slope and curvature estimates and implements principled approximations that allow for a trade-off between computational cost and accuracy.
Computing the exact Taylor polynomial is prohibitively expensive for neural networks. It is therefore typically approximated on a small subset of the training data. In our third contribution, we show that this leads to systematic biases in the Taylor approximation. We provide a theoretical explanation for this phenomenon and explain its negative effects on curvature-based methods. Finally, we develop effective strategies for correcting these biases. These strategies incur negligible computational overhead, but improve the quality of the approximation significantly.
This thesis advances curvature-based methods for machine learning: Methodological innovations improve their efficiency, software implementations make them more accessible, and acknowledging and correcting inherent approximation errors increases their reliability and effectiveness. Our contributions support the broader application of curvature-based methods in machine learning research and practice.