Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

posted May 30, 2020, 7:30 AM by Chris G   [ updated May 30, 2020, 7:31 AM ]

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

This week, we open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models. In our associated paper, we demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).

What Makes BERT Different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence LearningGenerative Pre-TrainingELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectionalunsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).