Concepedia

Publication | Open Access

Surprises in high-dimensional ridgeless least squares interpolation

480

Citations

80

References

2022

Year

Abstract

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum <i>ℓ</i> <sub>2</sub> norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters <i>p</i> is of the same order as the number of samples <i>n</i>. We consider two different models for the feature distribution: a linear model, where the feature vectors <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>x</mml:mi> <mml:mi>i</mml:mi></mml:msub> <mml:mo>∈</mml:mo> <mml:msup><mml:mi>ℝ</mml:mi> <mml:mi>p</mml:mi></mml:msup> </mml:mrow> </mml:math> are obtained by applying a linear transform to a vector of i.i.d. entries, <i>x</i> <sub><i>i</i></sub> = Σ<sup>1/2</sup> <i>z</i> <sub><i>i</i></sub> (with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>z</mml:mi> <mml:mi>i</mml:mi></mml:msub> <mml:mo>∈</mml:mo> <mml:msup><mml:mi>ℝ</mml:mi> <mml:mi>p</mml:mi></mml:msup> </mml:mrow> </mml:math> ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, <i>x<sub>i</sub></i> = <i>φ</i>(<i>Wz</i> <sub><i>i</i></sub> ) (with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>z</mml:mi> <mml:mi>i</mml:mi></mml:msub> <mml:mo>∈</mml:mo> <mml:msup><mml:mi>ℝ</mml:mi> <mml:mi>d</mml:mi></mml:msup> </mml:mrow> </mml:math> , <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>W</mml:mi> <mml:mo>∈</mml:mo> <mml:msup><mml:mi>ℝ</mml:mi> <mml:mrow><mml:mi>p</mml:mi> <mml:mo>×</mml:mo> <mml:mi>d</mml:mi></mml:mrow> </mml:msup> </mml:mrow> </mml:math> a matrix of i.i.d. entries, and <i>φ</i> an activation function acting componentwise on <i>Wz</i> <sub><i>i</i></sub> ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

References

YearCitations

Page 1