How not to use BERT for Document Ranking

Jo Kristian Bergum
7 min readOct 14, 2020
Photo by Markus Winkler on Unsplash

BERT (Bidirectional Encoder Representations from Transformers) turned 2 years a few days ago, and since its introduction it has been a revolution for Search and Information Retrieval. It has drastically improved the accuracy on many different information seeking tasks, be it answering questions or ranking documents, far beyond what was thought possible just a few years ago. In this blog post I’ll give an quick overview of how to evaluate search ranking models using well established relevancy datasets and how to achieve terrible ranking results using BERT in a way it was not meant to be used with a few good pointers on how to successfully apply BERT for ranking.

Document ranking is the task where documents or text passages are retrieved and ranked from a collection of documents according to their relevance to the information need of a specific query or question formulated by users. There exist several datasets which are used to evaluate the effectiveness of retrieval and ranking models and the MS MARCO dataset is the largest relevancy dataset which is available in the public domain. The MS Marco dataset is split into several parts, where one part can be used to train and develop a ranking model and another part which is used for evaluation. The evaluation set is used to evaluate the effectiveness of the proposed model and submissions are compared to a baseline and a leaderboard is generated so researchers and others can compare various methods. It also comes with bragging rights:

Below is a snapshot of the passage ranking leaderboard and the document ranking leaderboard at the time of writing this post:

Passage Ranking MS Marco Leaderboard https://microsoft.github.io/msmarco/
Document Ranking MS Marco Leaderboard https://microsoft.github.io/msmarco/

The best performing ranking models on the MS Marco relevancy collection is largely dominated by BERT or Transformer models in general. The best performing models achieves a MRR score of above 0.4 for passage ranking and close to 0.4 for document ranking. So what does MRR measure? MRR stands for Mean Reciprocal Rank and sounds a bit scary but in reality is a very simple and intuitive metric used to evaluate information retrieval systems and ranking models. RR (Reciprocal Rank) measures where in the ranked list of documents the first relevant (as judged by a human) is found. If the relevant document is ranked at position 1, the RR score is 1.0, ranked at position 2 give a score of 0.5, position 3 gives a RR score of 0.33 and so on. The final MRR score for a ranking model on a dataset is then simply the mean RR computed over all questions in the evaluation dataset. Calculating the mean for all queries in the test/eval query set gives you the overall effectiveness of the ranking model.

This is very important point which many misses when writing about ranking and implementing a search engine. With a large enough query set you can cherry pick queries where your favourite ranking model A performs better than existing model B and there is no shortage of examples of blog posts which does exactly this. When evaluating if a ranking model is better than a baseline or a different model you should not trust cherry picked queries, but see if the authors can present results from evaluating the model using a well known relevancy collection and compare the results they claim with SOTA ranking methods.

Speaking of baselines, the baseline ranking model used in the MS Marco passage ranking is term based BM25 which is the ranking model you get by default with open source search engines like ElasticSearch, SOLR which both are based on the Lucene Java search library. The official BM25 baseline model achieves a MRR score of 0.165 on the passage ranking task. So on average, using plain old term based BM25 the relevant document is found at position 6 (1/0.165), while using a Transformer based model moves the relevant document to position 2.5 (1/0.4). Basically with Transformer model ranking we move the relevant hit above the fold, increasingly important where an increasing number of searches are done from a device with limited screen estate.

Remember, it’s evaluated over a large query evaluation set and not just a few cherry picked queries. Given the effectiveness of BERT models or Transformers in general for ranking as demonstrated on various ranking leaderboard it’s no surprise that the large Web Search engines are using it in production. For instance, see Google’s blog post on understanding searches better than ever before and Bing delivers its largest improvement in search.

The Google and Bing announcement made it to the main stream media and at that time one could see a lot of interest in BERT for ranking and we started to see blog posts exemplified by these two Semantic Search at Scale with BERT and ElasticSearch, ElasticSearch meets BERT appearing.

Most of these blog posts follows the same pattern where they use BERT as a representation model where the question and documents are encoded by BERT independently. Bert-as-a-service is often used, a library which allows passing input text (question, or document) and have the text mapped to a dense embedding representation using a pooling strategy over the output of the last BERT layer.

Note there is nothing wrong with the Bert-as-a-service but using BERT embeddings without fine-tuning the pre-trained model for the ranking task gives next to random results when evaluated over a well known relevancy collection like MS Marco. Nothing wrong with ElasticSearch either, but its dense vector type support does not allow fast approximate nearest neighbor search so retrieving is done by brute force computation over all documents.

So how would a retrieval system described in the ElasticSearch meets BERT blog post perform when evaluated over the MS Marco dataset? Not so good it turns out.

End of MS Marco Passage Ranking Leaderboard https://microsoft.github.io/msmarco/

As seen above, the BERT representation model gets a MRR score of 0.015 on the evaluation query set! To put it in other words, the relevant document for questions were on average found atposition 66 and it would cost orders of magnitude more than simple BM25 to put this model in production with low latency using ES. To summarise using such a ranking model would cause a terrible user experience and be very costly when using dense brute force retrieval using ES. In a normal search interface you would have to paginate to page 6 before seeing the first relevant hit, not a great user experience and surely not State-of-Art. So what went wrong? It was using BERT? Why?

The issue was that the BERT model used was not fine-tuned for ranking. The BERT representation model submission above was a submission from Microsoft research, as part of their paper Understanding the Behaviors of BERT in Ranking [Qiao et al. ’19].

The authors, for exploration, used the raw pre-trained BERT model as is without fine-tuning it for ranking and used it directly, in a way the the authors of the original paper never suggested. The BERT representation model submission above was a made by Microsoft Research, as part of their paper Understanding the Behaviors of BERT in Ranking [Qiao et al. ’19].

The paper also describes how to do apply BERT with success, using fine-tuning and with full interaction between the query and document candidate.

To see how to successfully use BERT for ranking see the excellent book Pretrained Transformers for Text Ranking: BERT and Beyond [Jimmy Lin et al]. The paper was released yesterday (October 13, 2020) and is probably the best and most comprehensive resource available for understanding how to use BERT successfully for ranking.

Note that there are BERT representation models which also outcompetes traditional term wise BM25 ranking models, but the representation needs to be trained for the ranking task. This is also covered in the mentioned paper. Also, The Dense Passage Retrieval for Open-Domain Question Answering [Karpukhin et al] describes such a model where the query and the document are encoded using two different BERT models which are trained jointly to represent queries and documents in the same embedding space.

If you are interested in improving your search implementation using BERT for ranking, you should look at Vespa.ai, the open source big data serving engine. It offers fast dense embedding retrieval using approximate nearest neighbor search, BM25 (term based ranking) allowing hybrid retrieval. It also support evaluation of BERT ranking models done the right way. In the below blog post the team describes how to implement the mentioned Dense Passage Retrieval paper with Vespa, achieving state of the art results for question answering using BERT.

--

--