A vector database is a specialised storage and retrieval system built to store high-dimensional numerical vectors — called embeddings — and run fast similarity searches across them. It is not a relational database and it is not a search engine. Relational databases are optimised for exact match and structured queries. Search engines like Elasticsearch are optimised for keyword overlap. Vector databases are optimised for semantic similarity: finding items that are conceptually close, even when the words are completely different.
This distinction matters in production. If you are building a RAG system, a semantic search feature, a recommendation engine, or any application where meaning matters more than exact text match, you need a system designed for this retrieval pattern. A Postgres table with a LIKE query will not do it.
Why Vector Databases Exist
The explosion of embedding models created the problem that vector databases solve. Models like OpenAI's text-embedding-3-large, Cohere Embed, and Google's text-embedding-004 can represent text, images, audio, and structured data as dense numerical vectors — typically between 768 and 3072 dimensions. Two pieces of text with similar meaning will have vectors that are close together in this high-dimensional space, even if they share no words.
Finding similar items means finding vectors that are close. The challenge is doing this at scale, across millions or hundreds of millions of vectors, in milliseconds. That is computationally expensive with naive approaches. Vector databases solve this with purpose-built indexing algorithms.
How They Work: ANN Indexing
Vector databases index vectors using Approximate Nearest Neighbour (ANN) algorithms. The most widely used is HNSW — Hierarchical Navigable Small World — originally described in the 2016 paper at arxiv.org/abs/1603.09320. HNSW builds a multi-layer graph where each layer is a sparser representation of the dataset. At query time, the algorithm navigates from the top layer down to find approximate nearest neighbours efficiently. It offers a strong tradeoff between search speed and recall.
The alternative is IVF (Inverted File Index), which clusters vectors and searches only the relevant clusters at query time. IVF uses less memory but is generally less accurate than HNSW at equivalent speed settings. For most production RAG and search use cases, HNSW is the right default.
At query time, the process is: embed the query using the same model used to embed your documents, send the query vector to the database, receive the k most similar vectors by cosine similarity or dot product distance.
Cosine similarity vs dot product: For normalised text embeddings (standard for most embedding models), cosine similarity and dot product produce the same ranking. Cosine similarity is the safer default if you are unsure whether your embeddings are normalised.
The Vector Database Landscape
Pinecone
Fully managed, serverless and pod-based tiers. Strong developer experience and production track record. Used widely for production RAG at scale. The serverless tier scales to zero cost when idle — useful for variable workloads. The tradeoff: proprietary with no self-hosted option. If data sovereignty or vendor lock-in is a concern, Pinecone is not the right choice. Docs: docs.pinecone.io.
Weaviate
Open-source with a managed cloud offering. Strongest in hybrid search — combining vector similarity with keyword (BM25) scoring in a single query. If your use case requires mixing semantic and keyword retrieval (common in enterprise search), Weaviate handles this natively and well. Good support for multi-modal data. Docs: weaviate.io/developers/weaviate.
Qdrant
Open-source, high performance, excellent filtering capabilities. Qdrant's payload-based filtering is one of the more robust implementations, allowing pre-filtering by metadata before nearest-neighbour search. Can be self-hosted or used as a managed cloud service. Strong choice for teams with self-hosting requirements. Docs: qdrant.tech/documentation.
Chroma
Open-source, simple API, excellent for local development and prototyping. The easiest vector database to get running in an afternoon. Not the right choice for production systems at scale — less battle-tested under high load than Pinecone, Weaviate, or Qdrant. Use Chroma to prototype; migrate to a production-grade option before going live at scale.
pgvector
A PostgreSQL extension that adds vector storage and similarity search to an existing Postgres database. If your stack already runs Postgres, pgvector adds vector search without introducing a new database system to operate. Dramatically lower operational overhead. The scale ceiling is real: pgvector handles up to roughly 1–5 million vectors well with HNSW indexing. Beyond that, query latency starts to degrade relative to purpose-built vector databases. GitHub: github.com/pgvector/pgvector.
Redis Vector Search
In-memory, ultra-low latency. Good for real-time similarity search at lower vector counts where sub-millisecond response is required. Best suited for use cases with a bounded dataset size that fits comfortably in memory — session-level personalisation, real-time feature lookups, online recommendation at limited scale.
Key Metrics
- Recall: What fraction of the true nearest neighbours are actually returned. HNSW with default settings typically achieves 95–99% recall. Lowering recall improves speed; know your acceptable floor before tuning.
- Latency: P50 and P95 query latency. For interactive RAG applications, P95 under 100ms for vector retrieval is a reasonable target. Monitor P95, not just average.
- Throughput: Queries per second at your target recall and latency. Test at realistic concurrency before going to production.
- Index build time: Relevant when you need to ingest large volumes quickly or rebuild the index frequently.
- Memory per million vectors: At 1536 dimensions (OpenAI text-embedding-3-small), HNSW typically requires 6–10GB per million vectors.
Critical Design Decisions
Dimensions: Higher dimensions capture more semantic nuance but require more memory and slower search. OpenAI's text-embedding-3-small at 1536 dimensions is a practical balance for most text RAG use cases. text-embedding-3-large at 3072 dimensions improves retrieval quality measurably but roughly doubles memory requirements.
Distance metric: Choose when you set up the index — you cannot change it without rebuilding. Cosine similarity for text embeddings. Dot product for embeddings you know are normalised. Euclidean (L2) for image or audio embeddings where absolute magnitude carries information.
Filtering: Metadata filtering lets you restrict similarity search to a subset of your data. Pre-filtering — applying the filter before running ANN search — is safer and more accurate than post-filtering. Post-filtering runs similarity search first and then removes results that don't match the filter, which can return fewer than k results if many are filtered out.
Chunking strategy: Vector databases store what you put in them. Poor chunking — chunks that are too small to carry meaning, or too large to be retrieved precisely — degrades retrieval quality regardless of which vector database you use.
pgvector vs Purpose-Built: The Decision Threshold
Use pgvector if: you already run Postgres, your vector count is below 1–5 million, you do not have strict P95 latency requirements for vector retrieval, and minimising operational complexity is a priority.
Move to a purpose-built vector database if: your vector count exceeds 5 million; you need P95 retrieval under 50ms at high concurrency; you need advanced filtering, hybrid search, or multi-tenancy at scale; or you need a managed service with SLAs.
Common Failure Modes
Indexing stale embeddings: If the source document changes, the embedding in the database is outdated. Build a re-indexing pipeline that detects document updates and re-embeds and re-upserts. Without this, your RAG system silently returns outdated information.
Dimension mismatch: If you embed with one model at setup and later switch embedding models, stored vectors and new query vectors live in incompatible spaces. Similarity search returns garbage. Treat your embedding model as a versioned dependency — changing it requires full re-indexing.
Ignoring recall at query time: Running HNSW at default settings is fine for prototyping. In production, measure actual recall against a ground truth evaluation set. If recall is below 90%, tune ef (HNSW search parameter) upward before blaming retrieval quality on anything else.
No metadata filtering design: Building the schema without thinking about which metadata fields will be used as filters. Once the index is built, adding new filterable metadata fields requires re-ingestion. Design your metadata schema before first ingest.
References
- HNSW paper: arxiv.org/abs/1603.09320
- Pinecone docs: docs.pinecone.io
- Weaviate docs: weaviate.io/developers/weaviate
- Qdrant docs: qdrant.tech/documentation
- pgvector GitHub: github.com/pgvector/pgvector
Talk to an AI Implementation Expert
If you want help choosing a vector database for your specific use case or auditing your existing retrieval pipeline, book a working session.
Book a call: https://calendly.com/ai-creation-labs/30-minute-chatgpt-leads-discovery-call
During the call we can cover:
- vector database selection for your stack and scale
- retrieval architecture and chunking strategy
- embedding model selection and dimension tradeoffs
- recall evaluation and latency benchmarking