How BERT works – a simple view

There is a large variety of languages people use in different parts of the world.

Enormous contents, either in the form of literature or some other documents, are generated in each of these languages every day. People from a specific region globally are so adept at a particular language that they can understand and interpret these contents with the utmost ease at the back of their palms.

However, in today’s digital world, it is urged that computers also understand and interpret these languages performing various activities, minimising human effort, time, and cost.

Making the machines learn and understand human languages to perform some activity is a tedious and challenging task.

In this article, we’ll go over:

  • Machine Learning concepts
  • The story of BERT
  • How BERT works

Teaching the machine

Scientists and researchers have been attempting hard to unfold this challenge in the last decade. In this article will succinctly illustrate how machines can perform various activities by understanding and interpreting natural languages.

The Natural Process of Language

Natural language processing (NLP) has been an active area of research for the last two or three decades.

The importance of NLP has been growing of late due to the massive digitalisation of operations in our day-to-day life. As a result, there are a variety of machine learning models that can perform various NLP related activities.

It is to be mentioned that the biggest challenge of making the machines learn a natural language is to train it well with sufficient data. Creating a nicely labelled dataset for training a machine model to learn a language is a tedious job and a challenge by itself. However, the concept of transfer learning can be indeed handy and is a saviour in such cases.

Introduction to Transfer Learning

Transfer learning is a concept where a model trained for a given task can be fine-tuned or retrained to do something similar without starting the whole training process from the scratch and thereby reducing time and effort.

In 2018, a team of researchers at Google AI language published a model massively trained on a large corpus of contexts taken from Wikipedia pages. This model, called Bidirectional Encoder Representation from Transformers (BERT), has revolutionised the whole NLP community.

The pre-trained BERT model can be fine-tuned with smaller datasets to perform several NLP tasks such as question answering, sentiment analysis, sequence classification, sequence generation, and many more.

In order to use BERT to perform these tasks, it is necessary to understand how it works.

How BERT works

In general, language models can either be contextual or context-free and contextual representations can further be unidirectional or bidirectional. The models that existed before BERT are all unidirectional, meaning they looked at a sequence either from left-to-right or combined left-to-right and right-to-left. Moreover, if the model is context-free, then a particular word will have the same meaning in every sentence it occurs without any consideration of the context.

On the other hand, models that are contextual, generate a representation of each word based on the rest of the other words in the sentence. Thus, contextual models are closer to the real situation.

BERT is the first bidirectional contextual model that generates a representation of each word in the sentence by using both its previous and next context.

Masked Language Modelling

Unlike the previously existing models, BERT uses a technique called masked language modelling (MLM) in which some of the words in the sentence are masked and the model predicts those words instead of simply predicting the next word in a sentence.

In this MLM technique, the model has to look at both the left and right-hand parts of the sentence –along with both the directions using the full context of the words to predict the masked words. Thus, BERT is a powerful model to solve a variety of NLP related problems.

The architecture

The architecture of BERT is built upon transformers which are basically the attention mechanism that learns contextual relationships of the words in a sentence.

Normally, transformers consist of an encoder and a decoder. The encoder reads the text input, whereas the decoder produces the prediction for a given task. However, since the goal of BERT is to generate a language representation model, it only includes the encoder.

When some texts are fed as input to the BERT’s encoder, it splits the texts into words and converts them into tokens. Each of such tokens is assigned with a number according to BERT’s vocabulary. These are then converted into vectors and then processed in the neural network.

According to the BERT’s architecture, the input needs some more processing before it is passed on to the neural network. This processing includes attaching some more metadata to the input that includes:

  • token embedding
  • segment embedding
  • positional embedding.

Embedding the tokens

In token embedding, a [CLS] token is added at the beginning of the input text and a [SEP] token is added at the end of each sentence in the input text. In the segment embedding, the sentences are distinguished as sentence A and sentence B by using a marker.

Finally, the positional embedding is used to indicate the position of each token in the text sequence. The above-mentioned input decoration is rigorously provided by the transformers interface.

Once the input is ready with all the decoration, it is ready to process in the neural network to perform a given task.

It is important to mention here that BERT was not pre-trained using traditional either right-to-left or left-to-right language models. Rather, it was pre-trained using two unsupervised tasks: (a) MLM and (b) Next sentence prediction (NSP).

As described above, in MLM techniques, some percentage of words in the input texts are masked randomly, and then the LM is used to predict those masked tokens. In the second task, i.e. NSP, a binarised next sentence prediction task is pre-trained such that the model can understand sentence relationships.

The understanding of the relationship between two sentences is necessary for various NLP tasks such as question answering and natural language inference.

The performance of BERT was tested on General Language Understanding Evaluation (GLUE) benchmark, a collection of diverse natural language understanding tasks. BERT outperforms all the systems on all the tasks over the existing state-of-the-art. The BERTLarge model and BERTBase model both achieve an average accuracy improvement of 7% and 4.5%, respectively over all other language models. Notably, BERTLarge outperforms BERTBase model in the tasks from GLUE benchmark. Such improvement in performance can be attributed to the improvements in transfer learning with language models with unsupervised pre-training.

In this article, we have learnt a basic overview of how important NLP problems are and how BERT works to process natural languages. We will also discuss in detail other models like GPT-J, GPT-3, etc. Our next article will specifically talk about how to start using BERT for a problem you have at hand.