Three mistakes when introducing embeddings and vector search

8 min readApr 14, 2023

Representing unstructured data as embedding vectors and embedding-based retrieval (EBR) using vector search is more popular than ever. What are embeddings anyway? Roy Keyes explains it well in The shortest definition of embeddings?

Embeddings are learned transformations to make data more useful

In academia, this process is known as representation learning and has been a field of research for decades. By transforming the data into vectors, a language native to computers, we can make the data more useful. Take BERT for text as an example. Bidirectional Encoder Representations from Transformers (BERT).

How useful the representation is, depends on how we learn this transformation and how the learned way to represent data generalizes to new data. This is how we do Machine Learning. Take some data, learn something from it, then apply that learning to new data. Simple.

So what is new? Why the surge in interest? The answer is better model architectures (e.g., Transformer architecture) and self-supervised representation learning. Add a touch of confusion around Large Language Models (LLMs) such as chatGPT to the mix, and here we are.

About self-supervised learning. Using a clever objective, we can train a model using piles of data without human supervision (labeling). Then, once that is done, we can fine-tune the model for tasks where the fine-tuning requires less labeled data than if we started from scratch.

This type of learning pipelining is called transfer learning. Learning to snowboard also transfers to skateboarding, windsurfing, surfing, and other fun activities.

Transforming any data into useful representations

To shorten this blog post, let us focus on text models and BERT models specifically. How can we transform data into useful embedding representation using Transformer-based models?

Simple BERT model. Input length is limited to 512 words (technically wordpieces) due to quadratic computational complexity.

BERT is a deep neural network model with weights, layers, and whatnot, a complexity we hide inside the box. If we pull down the model from Huggingface, the model weights are assigned by pre-training using a masked language model objective.

We can take some text and tokenize that text into a fixed vocabulary to obtain a set of numeric ids. A mapping between free text and hard-coded identifiers. The vocabulary size depends on the language, but for the vanilla BERT model for English, this is around 30K words. Unknown words (out of vocabulary) are assigned UNK and given a specially reserved identifier. All unknown words are assigned to this identifier, and the model cannot differentiate between “foo” and “bar” if both are not in the vocabulary.

The BERT model can take a maximum of 512 words (input context length limitation), and the network output is 512 vectors with dimensionality N, depending on the type of bert-base model. A vanilla BERT model uses 768 dimensions. For an input of 512 words, we obtain a matrix of 512 x 768 floats, one 768-dimensional vector per input word. Unlike previous NLP model architectures, like Word2vec, each word vector representation on the output is contextualized by the attention mechanism in the Transformer architecture. The vector representation of a single word depends on all the other words in the input.

Now, we have multiple vectors representing a single text; what do we do if we want to represent a chunk of text, a text passage, or a paragraph of text in a single vector representation? One approach is to choose a single output vector as the representation and ignore the rest. Another approach is pooling. For example, average pooling will average the 512 output vectors into a single vector representation.

Now we have an embedding representation of a text chunk, which leads to mistake number 1.

Mistake #1: Using pre-trained models without task-specific fine-tuning

Using the direct vector representations from the model that have only been pre-trained will not produce a useful embedding representation for any task. Search ranking is an example of such a task; see details in How not to use BERT for search ranking.

Encoding free text queries and documents and expecting that the cosine similarity between the two representations can rank the documents by relevance is naive, and the results of that approach give you next to random ranking results. Your learned snowboard skills do not transfer to playing golf or swimming.

Mistake #2 Using fine-tuned single vector embedding models out-of-domain

To obtain a useful embedding representation (better than random) for search ranking, we need to tune the model weights. We can do that by using a different objective when training the model. We can train (update the weights) using labeled examples like relevant and irrelevant documents for a large sample of queries. MS MARCO is a large web search relevance collection with labeled queries and document pairs, which can be used to train a ranking model.

This fine-tuning creates useful embedding representations based on BERT and outcompetes traditional keyword search methods with no learnable parameters, such as BM25, by a very large margin on the MS MARCO dataset.

The problem is that when we take a single vector representation model, fine-tuned on MS MARCO labels, it does not beat BM25 in a different domain with slightly different types of documents and questions.

The BEIR Benchmark is an excellent framework for evaluating the quality of models trained on MS Marco and how well they transfer to different domains and tasks.

We studied the effectiveness of ten different retrieval models and demonstrate that in-domain performance cannot predict how well an approach will generalize in a zero-shot setup. Many approaches 9 that outperform BM25 in an in-domain evaluation on MS MARCO, perform poorly on the BEIR datasets.

I’ve written about zero-shot ranking and some solutions here, here, and here. Multi-vector representation model for search, like ColBERT, generalizes much better than single-vector representations.

Mistake 3: Lack of understanding of vector search tradeoffs

So you made it here and have useful embedding representations of data. Now, you need a way to search the vector data using the nearest neighbor search, also known as KNN, and you can deploy your exciting use case to production.

The first thing you should ask yourself is, will we need to introduce an approximate nearest neighbor search (ANNS) instead of an exact nearest neighbor search? As in many aspects of life, this is a question of tradeoffs.

On the query serving side. Even not considering the document side processing complexity, like the need for CRUD, real-time versus batch, etc.

Latency Service Level Agreement (SLA) — how fast do we need it to be? 0,001ms, 1ms, 10ms, 100ms, a second? Maybe 3 seconds is fine?
Throughput — What do we expect of query throughput, and what is the max we can anticipate? 1 QPS, maybe 1M QPS, or even billions?
Accuracy impact we can tolerate for our use case. When we introduce an approximate search, there is an accuracy loss compared to an exhaustive search. Depending on our choice of algorithm, this might be substantial. We can evaluate this impact by running queries using the exact search, comparing the approximate output, and calculating the overlap between the two. For example, using k=10, we can measure the overlap@10, or k=1, measure the overlap@1.

Given the above, it comes down to production deployment cost; how many servers do we need, or do we need servers at all?

Let us expand on the accuracy error tolerance and why that is use-case dependent. If you are building an image search service with over a billion photo vectors, you don’t necessarily need perfect recall. There are many equally great cat photos, and bringing back the exact best cats as deemed most relevant by the model might not be that important.

On the other hand, if you are building a retina image scan app using vector search to determine if the user can access the building, you better have great overlap@1. In academic research on ANN algorithms, there is a distinct differentiation between these extremes, high-recall and low-recall settings.

An exhaustive search might be all you need

The exact search for neighbors will brute-force calculate the distance between the query and all eligible documents, and the returned k documents are the true nearest neighbors. The search can be parallelized, multi-threaded, and in many cases, can use optimized HW instructions; vectors are the machine's language. The search can also efficiently be limited to a subset if we store the vectors in an engine with query engine filtering capabilities.

For example, brute force searching 1M vectors with 128 dimensions takes about 100ms single-threaded. We can parallelize the search; for example, by using four threads, we can get it down to 25 ms until memory bandwidth hits. If we page the vector data randomly from the disk, it will be slower but still parallelizable. If we have 10B vectors, and we don’t have a way to efficiently select a subset of documents that we perform the nearest neighbor search over, we have a cost problem. We can still get decent latency by distributing the search over multiple nodes in parallel, as Vespa can do. But renting servers to keep the latency in check can become costly with billions of embeddings. Add high query throughput to the mix, and we have a real cost problem.

Introducing approximations

Going down the approximate vector search route, we need an algorithm that can index the vector data so that searches are less costly than exhaustive searches at the cost of resource usage and indexing processing. Here there are also many tradeoffs, like disk usage and memory usage. How well the algorithm can be used with real-time CRUD operations. One source of ANN algorithm understanding is https://github.com/erikbern/ann-benchmarks, where different algorithms and implementations are compared on various vector datasets.

The above graph is for the SIFT dataset, with 1M 128-dimensional vectors. The graph displays recall@10 (same as overlap@10) versus the queries per second. The benchmark is single-threaded, which means that if the algorithm is at 10² QPS, we have a latency of 10ms. 10³ QPS means a latency of 1ms, and so forth. These algorithms are pretty damn fast.

If we deploy these algorithms on a server with multiple CPU cores, we can enjoy even more QPS. 2 cores are expected to give 2x QPS, as long as there aren’t any contention or locking scaling problems. But not all ANN algorithms give us equally good recall. Algorithms that are up and to the right give the best tradeoff between performance and accuracy, and the lower left quadrant is worse. As seen above, some algorithms struggle with getting past 50% recall.

What is not reflected in the graph above is the cost of indexing and whether the algorithm can support updates and CRUD operations. Some are batch-oriented, so they first need a large sample of the document vectors before they can build the index, while others can build the index incrementally. Note that ann-benchmark can only use open-source algorithms to reproduce on the same runtime. Some commercial and proprietary vector search vendors have unknown recall versus performance tradeoffs.

Summary

If you hated this post, you could shout out to me over at Twitter https://twitter.com/jobergum.

Three mistakes when introducing embeddings and vector search

Mistake #1: Using pre-trained models without task-specific fine-tuning

Mistake #2 Using fine-tuned single vector embedding models out-of-domain

Mistake 3: Lack of understanding of vector search tradeoffs

Summary

Written by Jo Kristian Bergum