Publication | Closed Access
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
869
Citations
12
References
2000
Year
Unknown Venue
Artificial IntelligenceWorse GeneralizationExcess CapacityPre-trainingEngineeringMachine LearningData ScienceConjugate GradientComputational NeuroscienceMachine Learning ModelComputational Learning TheoryEarly StoppingMulti-task LearningComputer ScienceRobot LearningNeural Architecture SearchRecurrent Neural NetworkNeural Scaling Law
Backpropagation networks with excess hidden units are traditionally thought to generalize poorly. The study demonstrates that neural nets with excess capacity can generalize well when trained with backpropagation and early stopping. The authors argue that excess capacity allows better fitting of highly nonlinear regions while backpropagation and early stopping prevent overfitting of low‑nonlinearity areas, and that large nets progress through learning stages analogous to smaller nets, enabling early stopping when performance matches smaller nets. The study finds that excess‑capacity nets generalize well with backpropagation and early stopping, whereas conjugate‑gradient training can worsen generalization by overfitting low‑nonlinearity regions.
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.
| Year | Citations | |
|---|---|---|
Page 1
Page 1