Reformer, Longformer, and ELECTRA: Key Updates To Transformer Architecture In 2020

The leading pre-trained language models demonstrate remarkable performance on different NLP tasks, making them a much-welcomed tool for a number of applications, including sentiment analysis, chatbots, text summarization, and so on. However, good performance usually comes at the cost of enormous computational resources that are not accessible by most researchers and business practitioners.
To address this issue, different research groups are working on increasing the compute-efficiency and parameter-efficiency of the pre-trained language models without sacrificing their accuracy. Among the novel approaches introduced this year, at least three methods are appraised by the AI community as very promising. To help you stay aware of the latest NLP research advancements, we have summarized the corresponding research papers in an easy-to-read bullet-point format.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:

Reformer: The Efficient Transformer
Longformer: The Long-Document Transformer
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

State-of-the-art Transformers in 2020

1. Reformer: The Efficient Transformer , by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Original Abstract 
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O( L 2 ) to O( L log L ), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Our Summary 
The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest (1) using reversible layers to allow storing the activations only once instead of for each layer, and (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better...