Transfer Learning: Entering a new era in NLP

By Malte Pietsch
October 2019

Hire Malte

With recent advances in deep learning based NLP it has become easier than ever before to apply cutting-edge models to your own texts. In this article we will introduce you to the method of transfer learning, why this helps you in harvesting your own texts and how you can actually implement such a model with open-source packages.

Harvesting Text Data with NLP

With the ongoing digitalization of companies the amount of data is growing rapidly. While some data is stored in carefully chosen formats in beautifully engineered relational databases, most of it is actually unstructured data like text, video and image. Harvesting information from this kind of files only became possible within the last few years thanks to great progress in deep learning. Natural Language Processing (NLP) is a discipline within the machine learning field that focuses on all kinds of text data. It is a particularly exciting field these days because we see innovations being published on a weekly basis, a very active and friendly open-source community and at the same time countless opportunities to plant NLP models in the real world - because written language is omnipresent.

Training Data: Why many models stay in the greenhouse

Despite incredible progress in research, many of the pretty models stay in the greenhouse and never make their way into the field of business applications. One major bottleneck here is labelled training data. While there are many openly available datasets that are used heavily to breed models in research, most of them do not help too much with the existing business problems out there. For NLP the main difficulty is the heterogeneity in language due to:

  • National languages (e.g. German)
  • Domain specific language (e.g. technical terms in aerospace)
  • Company specific language (e.g. product names or abbreviations)

That’s why many published research models flower like pretty, colorful orchids, but rapidly wither once planted into the industrial soil of your own use case. Let’s take one exemplary NLP problem “Named-Entity Recognition”, where we try to identify mentions of higher-level concepts within a text. Probably the most prominent dataset for this problem (Conll 2003) focuses on the concepts “person”, “location” and “organization”. The dataset fostered a lot of research on NLP models recognizing these concepts, many pre-trained models became publicly available and the big players have these kind of concepts also available in their commercial APIs. Today’s systems, e.g. FLAIR, reach impressive accuracy on these concepts (F1-Score: 0.93). However, in most business cases that we came across, the recognition of persons, locations and organization is not exactly what is needed. Instead, the recognition of specific law clauses, serial numbers, error codes or product names is often what creates real business value. Given the shortness of public training data for these cases, the only way of proceeding so far was to create training data by yourself requiring enormous human labour and slowing down the whole machine learning project.

Transfer Learning: Simplifying the path from the greenhouse to the field

Transfer Learning is the idea of learning on one kind of problem (pretraining) and transferring the acquired knowledge to a new target problem, where it can be used as a base for further fine-tuned learning (downstream task). This is particularly appealing for practitioners because the pretrained models offer a smooth way out of the research greenhouse. The models can be adapted to specific business problems with less training data and even yield better performance. This idea is not new and has been applied successfully to many Computer Vision problems.

However, transfer learning for NLP is very challenging as the model would require to learn some basic but general understanding of language. But how can we teach a computer what a word means? This task is central to the field of NLP and since they were published, word2vec and GloVe were the go to algorithms for generating word vectors. While they clearly do a good job of grouping together words which have similar semantics, they’re limited because they cannot capture the different meanings of the one word. That is, these models see no difference between the bark you find on a tree and the bark of a dog.

In recent years, there have been a species of new language models tackling this very problem and transferring their learned knowledge to downstream tasks brought significant performance improvements. For example, if we look at the performance curve of the popular SQuAD 2.0 challenge, where all leading research labs and tech giants compete on the task of question answering, we see an enormous spike in October 2018:

This uplift in question answering and many other tasks has been caused by a new language model architecture published by Google called “BERT” which brought the more general concept of transfer learning to a new level in NLP.

Pretrained models: One plant, many fruits

Due to BERT's success there's now a whole family of related models including the very recent XLnet and XLM. They all share a similar architecture and utilize the same steps of transfer learning. They all are deep neural network architectures build out of Transformer blocks which make use of the attention mechanism. If you are interested in the model internals, have a look at this great blog article by Jay Alammar to get a first intuition about BERT and then read the original paper. In this article, we want to highlight rather the practical handling than the specific model internals. In our daily work at deepset, where we build custom NLP models for all kinds of companies, we often experience that there is a big gap between the latest NLP methods in research and the practical application in the industry. We believe reducing the barrier to “just try a model out” and democratizing NLP is key to shrinking this gap.

So how can you make use of a pre-trained language model like BERT for your own NLP problem? The process consists of three main steps:

  • Decide on a pre-trained language model (e.g. BERT)
  • Optional: Adapt the language model to language from your target domain by feeding in plain text
  • Adapt the model to your target problem (downstream task) by training on a small labelled data set

Thankfully there are a couple of open-source packages out there that support you in this process. The original tensorflow repository has various BERT models for different languages available and offers basic scripts to replicate the results in the paper. Huggingface’s pytorch-pretrained-bert is a port of this original repo but with some nice additions including more model architectures, advanced multi-GPU support and additional scripts for language model fine-tuning. It’s purpose however is still rather replication of research results and being a starting point for your own model derivatives. For our practical work at deepset we needed some transfer learning framework that allows

  • easy & fast adaptation of pretrained models to your target problem
  • powerful logging & tracking of models
  • persisting model parameters & configs for reproducibility
  • simple visualization and deployment to showcase your prototype

That’s why we decided to build our own Framework for Adapting Representation Models (FARM) a few months ago. After finding it highly useful in our daily work, we decided to share it with the community and make it open-source. Building your own state-of-the-art NER model with FARM takes only a few major steps:

  • 1) Select a pretrained model

        language_model = Bert.load("bert-base-cased")

  • 2) Add a prediction head on top hat fits your target problem

        prediction_head = TokenClassificationHead(layer_dims=[768, 9])
        model = AdaptiveModel(

  • 3) Create a data processor that reads in your files and converts the data into the appropriate format for the model

        processor = CONLLProcessor(tokenizer=tokenizer,

  • 4) Train the model

        trainer = Trainer(...)
        model = trainer.train(model)

You find a comprehensive tutorial with more details and ready-to-run in this Jupyter notebook. This should give you all the tools to plant your own language models and harvest their delicious fruits ...

Happy FARMing!

Author: Malte Pietsch, co-founder & NLP engineer

Malte is an experienced Data Scientist and co-founder of deepset. He crafted cutting-edge machine learning models and delivers them into various industries. He has put a special focus on deep learning based NLP.

Hire Malte