PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector\n Elimination

Abstract

We develop a novel method, called PoWER-BERT, for improving the inference\ntime of the popular BERT model, while maintaining the accuracy. It works by: a)\nexploiting redundancy pertaining to word-vectors (intermediate encoder outputs)\nand eliminating the redundant vectors. b) determining which word-vectors to\neliminate by developing a strategy for measuring their significance, based on\nthe self-attention mechanism. c) learning how many word-vectors to eliminate by\naugmenting the BERT model and the loss function. Experiments on the standard\nGLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference\ntime over BERT with <1% loss in accuracy. We show that PoWER-BERT offers\nsignificantly better trade-off between accuracy and inference time compared to\nprior methods. We demonstrate that our method attains up to 6.8x reduction in\ninference time with <1% loss in accuracy when applied over ALBERT, a highly\ncompressed version of BERT. The code for PoWER-BERT is publicly available at\nhttps://github.com/IBM/PoWER-BERT.\n