Publication | Open Access
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
56
Citations
20
References
2019
Year
Structured PredictionLlm Fine-tuningEngineeringMachine LearningAdvanced ComputingComputer ArchitectureComputer-aided DesignLarge Language ModelLocalizationContext ManagementCorpus LinguisticsText MiningNatural Language ProcessingCuda KernelsData ScienceComputational LinguisticsTransformer ModelModeling And SimulationParallel ComputingMachine TranslationBinary PartitioningComputer ScienceDeep LearningNeural Machine TranslationContext ModelParallel Programming
The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields $O(k\cdot n\log (n/k))$ connections where $k$ is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch.
| Year | Citations | |
|---|---|---|
Page 1
Page 1