Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

TLDR

Fine‑tuning updates all model parameters while linear probing updates only the final layer, and fine‑tuning is known to yield higher in‑distribution accuracy. The study theoretically demonstrates that the trade‑off between in‑distribution and out‑of‑distribution accuracy arises even in simple overparameterized two‑layer linear networks. The authors analyze fine‑tuning of such networks, proving that initializing with a fixed or random head causes high out‑of‑distribution error because simultaneous updates of lower layers distort pretrained features. Across ten distribution‑shift datasets, fine‑tuning improves in‑distribution accuracy by about 2% but degrades out‑of‑distribution accuracy by roughly 7% relative to linear probing, while a two‑step linear probing followed by fine‑tuning (LP‑FT) achieves the best results, outperforming both methods by 1% ID and 10% OOD.

Abstract

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).