Best Research Papers From ACL 2020

ACL is the leading conference in the field of natural language processing (NLP), covering a broad spectrum of research areas in computational linguistics. Due to the COVID-19 risks, ACL 2020 took place 100% virtually, similar to other big academic conferences of this year.
However, as always, it was the best place to learn about the latest NLP research trends and cutting-edge research papers in language modeling, conversational AI, machine translation, and other NLP research topics.
Following the long-standing tradition, the best paper awards were announced during the last day of the main conference. In this article, we’ve summarized the key research ideas of the papers that received the Best Paper Award and Honorable Mentions at ACL 2020.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

ACL 2020 Best Paper Awards
1. Beyond Accuracy: Behavioral Testing of NLP models with CheckList , by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
Original Abstract
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Our Summary
The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList , a...