Zero Shot Classification

3 min readJun 1, 2021

by Paramdeep Khangura

Language transformers like BERT have really pushed the boundaries of what’s possible in NLP. They’ve been used to solve problems such as topic classification, text summarisation and question answering. They’ve also spawned models including RoBERTA (larger), DistilBERT (smaller) and XLM (multi language).

Fundamentally, the transformer models mentioned above are trained on huge text datasets like Wikipedia and the BooksCorpus so they understand language structure. These pre trained models are then fine tuned using training data for particular tasks.

Zero shot classification is the process of using these pretrained models and architectures to perform classification for a set of labels that the model hasn’t been fine tuned to look for. In certain situations, this has the potential to be incredibly useful as it omits the need for task specific training data.

Zero Shot in Action

This link will take you to a demo app hosted by Huggingface which lets you see how zero shot classification can be used. You enter some text and some labels that you’re looking for, and the model outputs some confidence percentage score for each label. There’s no need to provide any training data.

There are a few different ways to approach zero classification, but the demo uses an approach explored in this paper: https://arxiv.org/abs/1909.00161.

Training

This model is based on an NLI (natural language inference) approach. It uses a model trained on the MNLI dataset where the input is a pair of sentences. The first is a premise, the second is a hypothesis. The model is then trained to classify the hypothesis with respect to the premise using one of 3 labels: contradiction, entailment or neutral.

This was the pretrained model used in the paper for classification. The text to be classified was passed as the premise and the topic label is passed as the hypothesis. We use the calculated probability of the entailment label (after disregarding neutral) to determine how confident the model is in our topic label being relevant for the piece of text.

The paper evaluated the zero shot model on 3 variations of a classification problem, topic, emotion and situation. Focusing on topic, this was done using a Yahoo answers dataset and returned an F1 score of 37.9. Another common approach to this task is using the vector representation of the text and topic (via Word2Vec) and evaluating their cosine similarity. This was done in the paper and returned an F1 score of 35.7.

Using the same MNLI approach but with a BART-Large model replacing BERT produces an F1 score of 53.7, considerably better than the Word2Vec baseline.

Applications

The accuracy using a zero shot model won’t match that of a classifier trained on a particular dataset, but the results above prove that it is still very capable and useful, making it a viable alternative for some tasks, depending on the type of text.

Zero shot doesn’t work as well when the topic is a more abstract term in relation to the text, which is understandable given how the model was pretrained.

Some tasks may require a higher level of performance and so a trained classifier will always be the preferred option. Obtaining the training data can be difficult, time consuming and/or expensive. You could potentially use a zero shot model on a large unrefined dataset such as news stories or social media posts through a firehose type API as a way of filtering down or classifying texts, before manually refining the dataset. This can then be used in a fully trained classifier.

Zero shot classification is definitely worth exploring and keeping an eye on, particularly as people develop additional capabilities and applications like few shot learning (refining a pretrained model but with a limited number of training samples) and zero shot image classification.

Zero Shot Classification

Zero Shot in Action

Training

Applications

Written by Digitas Data Science