Skip to main content

Distance Metrics

When you query a vector database, it ranks results by how “similar” the query vector is to each stored vector. But similarity is defined mathematically - and the definition you choose affects the quality of your results. VecLabs supports three distance metrics: cosine similarity, euclidean distance, and dot product.

Cosine Similarity

Cosine similarity measures the angle between two vectors. It ignores magnitude entirely and only cares about direction.
cosine(A, B) = (A · B) / (|A| × |B|)
Range: -1 to 1 (VecLabs returns 0 to 1 after normalization)
  • Score of 1.0 = identical direction = maximum similarity
  • Score of 0.0 = perpendicular = no relationship
  • Score of -1.0 = opposite direction = maximum dissimilarity
Why it works for text: Embedding models like text-embedding-ada-002 produce vectors where the direction encodes meaning and the magnitude is largely noise. Two sentences can have different lengths and different word frequencies, producing vectors of different magnitudes - but if they mean the same thing, they’ll point in the same direction. When to use cosine:
  • Text similarity and semantic search
  • Document retrieval and RAG
  • AI agent memory
  • Any task where you’re using a language model’s embeddings
Example:
const collection = sv.collection("my-docs", {
  dimensions: 1536,
  metric: "cosine", // default - best for most use cases
});

Euclidean Distance

Euclidean distance measures the straight-line distance between two points in vector space. It considers both direction and magnitude.
euclidean(A, B) = sqrt(sum((A_i - B_i)^2))
Range: 0 to infinity
  • Distance of 0 = identical vectors
  • Larger distance = less similar
VecLabs returns this as a similarity score (inverted), so higher is still better. When to use euclidean:
  • Image embeddings where magnitude carries meaning
  • Audio or signal processing embeddings
  • Cases where your embedding model was specifically trained with euclidean distance
  • Recommendation systems based on collaborative filtering
When NOT to use euclidean:
  • Text embeddings from transformer models - cosine is almost always better for these
const collection = sv.collection("images", {
  dimensions: 512,
  metric: "euclidean",
});

Dot Product

Dot product is the raw inner product of two vectors without normalization.
dot(A, B) = sum(A_i × B_i)
Range: Unbounded
  • Higher is more similar
  • Magnitude affects the score - larger vectors score higher
When to use dot product:
  • Models that explicitly recommend dot product (e.g., some Cohere models)
  • Maximum inner product search (MIPS) problems
  • Recommendation systems where popularity (magnitude) should boost scores
  • Cases where you want to reward longer, more information-rich vectors
Note: If your vectors are normalized to unit length (magnitude = 1), dot product and cosine similarity produce identical results. Many embedding models output normalized vectors.
const collection = sv.collection("recommendations", {
  dimensions: 256,
  metric: "dot",
});

Quick decision guide

Are you using a language model's text embeddings?
  → Cosine similarity (almost always correct)

Are you working with images or audio?
  → Euclidean distance

Does your embedding model documentation specify a metric?
  → Use whatever it recommends

Are you building a recommendation system where popularity matters?
  → Dot product

Not sure?
  → Cosine similarity (safe default)

You cannot change metrics after collection creation

The distance metric is fixed at collection creation time. Changing it requires creating a new collection and re-indexing all your vectors. Choose carefully upfront.
Creating a collection with the wrong metric will produce silently incorrect search results - queries will return results, but they won’t be the right ones. Always verify your metric matches what your embedding model was trained with.

VecLabs raw benchmark numbers by metric

Measured on Apple M3, 384 dimensions, 1000 samples:
MetricRaw computation time
Dot product196ns
Euclidean212ns
Cosine303ns
Cosine is slightly slower because it requires computing the magnitude of both vectors before dividing. The difference is negligible at query time - the HNSW graph traversal dominates overall latency, not the distance computation.