Longformer: The Long-Document Transformer

TLDR

Transformer‑based models cannot process long sequences because their self‑attention scales quadratically with sequence length. The authors introduce Longformer, a transformer with linear‑scaling attention for long documents, and later present LED for long‑document generation, demonstrating effectiveness on arXiv summarization. Longformer replaces standard self‑attention with a local‑window plus task‑motivated global attention that scales linearly, and is evaluated on character‑level language modeling (state‑of‑the‑art on text8 and enwik8) and pretrained/fine‑tuned on downstream tasks. Pretrained Longformer consistently outperforms RoBERTa on long‑document tasks, sets new state‑of‑the‑art on WikiHop and TriviaQA, and LED achieves strong results on arXiv summarization.

Abstract

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

References

Page 1

	Year	Citations
HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case Yinhan Liu, Myle Ott, Naman Goyal, DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2019	17.1K
Sequence to Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le arXiv (Cornell University) Structured PredictionEngineeringMachine LearningSequential LearningSmt System	2014	13.3K
WaveNet: A Generative Model for Raw Audio Aäron van den Oord, Sander Dieleman, Heiga Zen, arXiv (Cornell University) MusicEngineeringMachine LearningData ScienceA Single Wavenet	2016	3.6K
Learning Word Vectors for Sentiment Analysis Andrew L. Maas, Raymond E. Daly, Peter T. Pham,	2011	3.3K
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context Zihang Dai, Zhilin Yang, Yiming Yang,	2019	3.1K
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, EngineeringMachine LearningNarrative SummarizationVideo SummarizationMovie/book Alignment	2015	2K
Unsupervised Data Augmentation for Consistency Training Qizhe Xie, Zihang Dai, Eduard Hovy, arXiv (Cornell University) Few-shot LearningEngineeringMachine LearningAutoencodersNatural Language Processing	2019	1.6K
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering Zhilin Yang, Peng Qi, Saizheng Zhang, EngineeringLanguage ProcessingText MiningNatural Language ProcessingData Science	2018	1.4K
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Mike Lewis, Yinhan Liu, Naman Goyal, Llm Fine-tuningEngineeringMathematical LinguisticsMike LewisNatural Language Processing	2020	1.2K
TVM: an automated end-to-end optimizing compiler for deep learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Operating Systems Design and Implementation EngineeringMachine LearningCompiler TechnologyComputer ArchitectureEmbedded Systems	2018	909

Page 1