Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models

TLDR

Large pre‑trained models underpin many ML systems and often embed harmful stereotypes learned from the internet. The study investigates the bias‑transfer hypothesis, positing that stereotypes learned during pre‑training propagate into harmful task‑specific behavior after fine‑tuning. We examine this by applying controlled interventions to reduce intrinsic bias before fine‑tuning and evaluating the resulting classifiers on two classification tasks. Reducing intrinsic bias before fine‑tuning has little effect on post‑fine‑tuning discrimination, and downstream disparities are better explained by biases in the fine‑tuning data, while pre‑training still matters because simple dataset alterations are ineffective when the model has been pre‑trained, suggesting that practitioners should prioritize dataset quality and context‑specific harms.

Abstract

A few large, homogenous, pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis: the theory that social biases (such as stereotypes) internalized by large language models during pre-training transfer into harmful task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier's discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.

References

Page 1

	Year	Citations

Page 1