Things to consider while choosing a vector database
Jayden Luo, Software Engineer
June 22, 2023
Last week we discussed the approach to Large Language Model (LLM) tooling that Peritus has landed on, focusing on some valuable insights (we think!) about the need for engineering from the ground up. We mentioned there that vector databases are an exception, with abundant options available that thoroughly cover the use cases most organizations will encounter.
Vector databases are data storage technologies designed to store and search high dimensional data such as LLM embeddings. These embeddings are vector representations of the semantics of text: what entities, meanings, relationships are mentioned, and what tone and sentiment the text communicates about them. This rich semantic information, along with the capacity to store and search it effectively, is central to many of the powerful applications of LLMs. In most vector databases, search is implemented using Approximate Nearest Neighbors (ANN), meaning that you are not guaranteed to get all the true nearest neighbors, so the choice of ANN algorithm is an important factor while considering search quality.
Before we mention storing anything, it’s worth highlighting that more and more high quality embeddings suitable for semantic similarity are becoming available for the public. A list of state of the art embeddings for various different tasks is maintained on the Massive Text Embedding Benchmark (MTEB) Leaderboard on Huggingface. At Peritus, we’ve found that OpenAI embeddings are great for semantic similarity out of the box and have decided to integrate them into our recommendation system.
While choosing a vector database for our use case, we discovered that despite the abundant options, oftentimes different vector databases provide very different features that make them more suitable for specific use cases and unsuitable for others. In this blog post, we share some of the important aspects that we considered while choosing a vector database:
- Hosting: self-hosted vs managed
- Features: pure vector search vs filtered search vs reranking
- Performance: search quality, latency and throughput
Managed vector databases are a good option if you don’t have time for infrastructure setup or algorithm tuning. However, it also means that it’s less configurable deployment-wise and is usually more expensive. Pinecone, for example, is a managed-only vector database. It uses a proprietary ANN algorithm, providing only three hardware (pod type) options and 4 sizes for each pod type. On the other hand, while self-hosting provides more control, it requires you to configure your deployment to meet your required uptime, query latency, throughput etc. For our deployments, having greater control over the configuration, latency, and cost are important considerations, so we chose self-hosted over managed.
In some applications, such as managing data for multiple customers, filtered search is required. Although most vector databases support filtered search out of the box, they support different sets of filter field types. For example, Milvus does not support storing `list` fields yet and Qdrant supports substring filter, geo radius filter and geo bound filter. In addition, different vector databases implement filtering differently, with consequences on performance that we'll discuss later in the Performance section. If your use case requires a combination of vector search and other ranking algorithms such as BM25, you should consider a general purpose search platform that has dense vector support instead. Examples include Apache Solr and Vespa.
If none of the fancy filtering and reranking capabilities are needed, you might want to instead consider using a general purpose database that has added vector search support, for example, Redis and PostgreSQL. These databases are broadly adopted with institutional knowledge allowing for a lower learning curve, they can be used for other purposes, and you might already have one deployed.
Generally, the search quality of a vector database is measured by recall: the fraction of true nearest neighbors found. In most ANN algorithms however, throughput (queries per second) and recall are trade-offs. ANN-benchmarks is a great tool for comparing different vector databases that graphs the trade-off curve and already contains benchmark results on many classic datasets. Note that it is still wise to run your own benchmark using the ANN-benchmarks tool because results can vary significantly for different embeddings and different types of queries on different hardware. As an example, Qdrant, Milvus and Weaviate all have different implementations for filtered search, and adding a filter has very different impacts on performance.
Bonus: highlighted features and limitations
Below are features and limitations that we highlighted for a few vector databases.
Next week we’ll zoom out and look at where the modern move to large language models found its origin, how modern technologies compare to what came before, and contextualize where they might go.