The Dark Secrets Of BERT


This blog post summarizes EMNLP 2019 paper  Revealing the Dark Secrets of BERT by researchers from the  Text Machine Lab at UMass Lowell:   Olga Kovaleva  ( LinkedIn ),  Alexey Romanov  ( LinkedIn ),  Anna Rogers  (Twitter:  @annargrs ), and  Anna Rumshisky  (Twitter:  @arumshisky ).
Here are the topics covered:

A brief intro to BERT
What types of self-attention patterns are learned, and how many of each type?
What happens in fine-tuning?
How much difference does fine-tuning make?
Are there any linguistically interpretable self-attention heads?
But… what information actually gets used at inference time?
Discussion

 
2019 could be called the year of the Transformer in NLP: this architecture  dominated the leaderboards  and inspired many analysis studies. The most popular Transformer is, undoubtedly, BERT  (Devlin, Chang, Lee, & Toutanova, 2019) . In addition to its numerous applications, multiple studies probed this model for various kinds of linguistic knowledge, typically to conclude that such knowledge is indeed present, to at least some extent (Goldberg, 2019;   Hewitt & Manning, 2019;   Ettinger, 2019) .
This work focuses on the complementary question: what happens in the  fine-tuned  BERT? In particular, how much of the linguistically interpretable self-attention patterns that are presumed to be its strength are actually used to solve downstream tasks?
To answer this question, we experiment with BERT fine-tuned on the following GLUE (Wang et al., 2018)  tasks and datasets:

paraphrase detection (MRPC and QQP);
textual similarity (STS-B);
sentiment analysis (SST-2);
textual entailment (RTE);
natural language inference (QNLI, MNLI).

Do you find this in-depth content on NLP research to be useful? Subscribe below to be updated when we release new relevant content .
 
A brief intro to BERT
BERT stands for Bidirectional Encoder Representations from Transformers. This model is basically a multi-layer bidirectional Transformer encoder  (Devlin, Chang, Lee, & Toutanova, 2019) , and there are multiple excellent guides about how it works generally, including  the Illustrated Transformer . What we focus on is one specific component of Transformer architecture known as self-attention. In a nutshell, it is a way to weigh the components of the input and output sequences so as to model relations between them, even long-distance dependencies.
As a brief example, let’s say we need to create a representation of the sentence “Tom is a black cat”. BERT may choose to pay more attention to “Tom” while encoding the word “cat”, and less attention to the words “is”, “a”, “black”. This could be represented as a vector of weights (for each word in the sentence). Such vectors are computed when the model encodes each word in the sequence, yielding...

Top