Vector Search Deep Dive
Overview
KillrVideo uses vector search to power semantic video recommendations. Instead of matching keywords, vector search understands the meaning of text — so a search for "action movies with car chases" can return videos about racing, heist films, and spy thrillers even if those exact words never appear in the video metadata.
This page explains how vector search works from first principles, how Astra DB implements it, and how KillrVideo uses it in the recommendations service.
What Are Vectors?
A vector is an ordered list of numbers. In the context of machine learning, vectors represent the "meaning" of text, images, or other data as a point in high-dimensional space.
Consider a simple example with 3 dimensions:
"action movie" → [0.8, 0.1, 0.3]
"thriller film" → [0.7, 0.2, 0.4]
"cooking video" → [0.1, 0.9, 0.1]
In this (simplified) space:
- "action movie" and "thriller film" are close together (similar meaning)
- "cooking video" is far away (different meaning)
Real-world embeddings use hundreds or thousands of dimensions to capture much more nuanced semantic relationships.
How Embeddings Work
An embedding model is a neural network trained to convert text (or other data) into vectors such that semantically similar content ends up geometrically close in the vector space.
The process:
Text Input Embedding Model Vector Output
───────────────── ───────────────── ──────────────────────────
"Learn to cook pasta" → NVIDIA NV-Embed-QA → [0.021, -0.043, 0.187, ...]
(4096 dimensions)
Training: The model learns from massive amounts of text, adjusting its internal weights so that sentences with similar meanings produce similar vectors. After training, the model is frozen and used purely for inference.
Why high dimensions matter: Each dimension captures a different "aspect" of meaning. With 4096 dimensions, the model can represent subtle distinctions — the difference between "how to cook pasta" (instructional) and "pasta restaurants in Rome" (informational) is encoded in specific dimensions.
The NVIDIA NV-Embed-QA Model
KillrVideo uses the NVIDIA NV-Embed-QA model for generating embeddings. Key characteristics:
| Property | Value |
|---|---|
| Model name | NV-Embed-QA |
| Embedding dimensions | 4096 |
| Input type | Text (up to ~512 tokens) |
| Optimized for | Question-answer retrieval tasks |
| Provider | NVIDIA API / Astra vectorize integration |
Why NV-Embed-QA:
- Optimized for asymmetric retrieval (short queries matching longer documents)
- High dimensional output (4096) captures fine-grained semantic distinctions
- Available directly through Astra DB's
vectorizefeature — no separate embedding service needed
Asymmetric retrieval means the query and the document don't need to be the same length or format:
- Query:
"funny cat videos"(short, natural language) - Document:
"A compilation of hilarious moments featuring cats knocking things off tables, hiding in boxes, and reacting to cucumbers."(long, descriptive)
The model generates vectors for both, and they end up close in the vector space even though they look very different as text.
Cosine Similarity
Once you have vectors, you need a way to measure how similar two vectors are. Cosine similarity is the standard metric for text embeddings.
Cosine similarity measures the angle between two vectors:
- 1.0 = identical direction (same meaning)
- 0.0 = perpendicular (unrelated meaning)
- -1.0 = opposite direction (opposite meaning)
The formula:
similarity(A, B) = (A · B) / (|A| × |B|)
Where · is the dot product and |·| is the vector magnitude.
Why cosine over Euclidean distance:
- Cosine similarity normalizes for vector length, so short and long documents aren't penalized just for being different lengths
- Text embeddings tend to cluster on a hypersphere, making cosine a natural fit
- Well-supported in Astra DB and most vector databases
Similarity thresholds in practice:
| Similarity | Interpretation |
|---|---|
| > 0.90 | Near-identical content |
| 0.75 – 0.90 | Strongly related topics |
| 0.60 – 0.75 | Related, same general domain |
| 0.40 – 0.60 | Loosely related |
| < 0.40 | Likely unrelated |
KillrVideo typically uses a threshold around 0.70 for recommendations — high enough to be relevant, low enough to surface useful variety.
Approximate Nearest Neighbor (ANN) Search
Finding the most similar vectors is conceptually a nearest-neighbor search: given a query vector, find the K vectors in the database closest to it.
Exact nearest neighbor is prohibitively slow at scale:
- Compare query vector against all N stored vectors
- O(N × D) time, where D = dimensions
- For 1 million videos × 4096 dimensions = 4 billion operations per query
Approximate Nearest Neighbor (ANN) trades a small amount of accuracy for massive speed gains using index structures (like HNSW — Hierarchical Navigable Small World graphs) that allow fast approximate search:
- Build an index structure at write time (slower writes, but done once)
- At query time, traverse the graph to find approximate nearest neighbors
- Typical result: 95–99% accuracy at 10–100x faster query speed
Astra DB uses HNSW under the hood for its SAI vector indexes.
Astra DB Vectorize Feature
The vectorize feature in Astra DB allows you to define an embedding model at the table level. When you insert text, Astra DB automatically generates the embedding vector. You don't call an embedding API separately.
Table Definition with Vectorize
CREATE TABLE killrvideo.video_vectors (
videoid uuid PRIMARY KEY,
title text,
description text,
tags set<text>,
embedding vector<float, 4096>
) WITH vectorize = {
'provider': 'nvidia',
'model': 'NV-Embed-QA'
};
The embedding column is of type vector<float, 4096> — a list of 4096 floating-point numbers.
Automatic Embedding on Insert
With vectorize configured, inserting a video automatically generates its embedding:
# No need to call NVIDIA API separately — Astra handles it
await video_vectors_table.insert_one({
"videoid": str(video_id),
"title": "Learning Python for Beginners",
"description": "A comprehensive introduction to Python programming...",
"tags": ["python", "programming", "tutorial"],
"$vectorize": "Learning Python for Beginners. A comprehensive intro to Python..."
})
The $vectorize field tells Astra which text to embed. The embedding column is populated automatically by calling the NVIDIA NV-Embed-QA model.
Manual Embedding (Without Vectorize)
If you generate embeddings yourself (e.g., using the NVIDIA API directly), you insert the vector explicitly:
import httpx
async def get_embedding(text: str) -> list[float]:
response = await httpx.post(
"https://integrate.api.nvidia.com/v1/embeddings",
json={
"input": text,
"model": "nvidia/nv-embed-qa-e5-v5",
"input_type": "passage",
"encoding_format": "float"
},
headers={"Authorization": f"Bearer {nvidia_api_key}"}
)
return response.json()["data"][0]["embedding"]
# Insert with explicit vector
embedding = await get_embedding("Learning Python for Beginners...")
await video_vectors_table.insert_one({
"videoid": str(video_id),
"embedding": embedding # 4096-element list
})
SAI Vector Indexes
Astra DB uses Storage-Attached Index (SAI) extended with vector capabilities to power ANN search.
Creating a Vector Index
CREATE CUSTOM INDEX video_vectors_embedding_idx
ON killrvideo.video_vectors(embedding)
USING 'StorageAttachedIndex'
WITH OPTIONS = {
'similarity_function': 'cosine'
};
Key options:
similarity_function:cosine(recommended for text),dot_product, oreuclidean- The index builds the HNSW graph structure automatically
ANN Search Query
-- Find 10 videos most similar to a query vector
SELECT videoid, title, similarity_cosine(embedding, [0.021, -0.043, ...]) AS score
FROM killrvideo.video_vectors
ORDER BY embedding ANN OF [0.021, -0.043, ...]
LIMIT 10;
The ANN OF clause triggers the vector index. Cassandra traverses the HNSW graph to find approximate nearest neighbors without scanning every row.
Using the Data API
Through the Astra Data API (which KillrVideo uses), vector search looks like:
async def find_similar_videos(query_vector: list[float], limit: int = 10):
video_vectors_table = await get_table("video_vectors")
results = await video_vectors_table.find(
filter={},
sort={"$vector": query_vector},
limit=limit,
projection={"videoid": True, "title": True, "$similarity": True}
)
return list(results)
The sort={"$vector": query_vector} tells the Data API to perform ANN search. The $similarity projection includes the cosine similarity score in the results.
Using Vectorize for Query Embedding
With vectorize, you can search using raw text instead of a pre-computed vector:
async def find_similar_videos_by_text(query_text: str, limit: int = 10):
video_vectors_table = await get_table("video_vectors")
results = await video_vectors_table.find(
filter={},
sort={"$vectorize": query_text}, # Astra generates embedding automatically
limit=limit,
projection={"videoid": True, "title": True, "$similarity": True}
)
return list(results)
Astra DB calls the NVIDIA NV-Embed-QA model on query_text, then performs ANN search with the resulting vector. All in one API call.
Similarity Thresholds
Not all ANN results are equally relevant. Filtering by similarity score ensures only meaningful results are returned:
MIN_SIMILARITY_THRESHOLD = 0.70
async def get_recommendations(video_id: str, limit: int = 10):
# Get the video's embedding
source_video = await video_vectors_table.find_one(
filter={"videoid": video_id},
projection={"embedding": True}
)
if not source_video or not source_video.get("embedding"):
return []
# Find similar videos
results = await video_vectors_table.find(
filter={"videoid": {"$ne": video_id}}, # Exclude self
sort={"$vector": source_video["embedding"]},
limit=limit * 2, # Over-fetch to account for threshold filtering
projection={"videoid": True, "title": True, "$similarity": True}
)
# Filter by minimum similarity
recommendations = [
r for r in results
if r.get("$similarity", 0) >= MIN_SIMILARITY_THRESHOLD
]
return recommendations[:limit]
Why over-fetch: ANN search returns the top-K results by similarity. If you apply a threshold filter after the fact, you might get fewer than K results. Over-fetching (e.g., 20 when you want 10) and then filtering gives you more buffer.
Choosing a threshold: Start at 0.70 and adjust based on result quality. Too high (0.90+) returns few results; too low (0.40) returns irrelevant content.
How KillrVideo Uses Vector Search
The Recommendations service uses vector search in two scenarios:
1. "More Like This" — Similar Videos
When viewing a video, the system finds semantically similar videos:
Video being watched:
"Introduction to Docker Containers"
(embedding: [0.18, 0.31, ...])
│
▼ ANN search
Similar videos returned:
- "Docker for Beginners" (similarity: 0.91)
- "Kubernetes Fundamentals" (similarity: 0.83)
- "Container Orchestration Guide" (similarity: 0.78)
- "DevOps Workflow Tutorial" (similarity: 0.71)
2. Semantic Search
When a user searches the catalog, their query is embedded and compared against video embeddings:
Search query: "how to make sourdough bread"
(embedded by NV-Embed-QA → 4096-dim vector)
│
▼ ANN search against video catalog
Results:
- "Sourdough Starter Guide" (similarity: 0.89)
- "Artisan Bread Baking" (similarity: 0.81)
- "Fermentation Basics" (similarity: 0.74)
The search returns relevant videos even if the video titles don't contain the exact search terms.
Performance Characteristics
| Operation | Typical Latency | Notes |
|---|---|---|
| Generate embedding (NVIDIA API) | 50–200ms | Network call to NVIDIA |
| Generate embedding (via vectorize) | 50–200ms | Handled by Astra internally |
| ANN search (vector index, 1M rows) | 10–50ms | HNSW graph traversal |
| ANN search without index | 1,000–10,000ms | Full table scan — avoid |
| Insert with vectorize | 100–300ms | Includes embedding generation |
Key takeaway: The embedding generation step dominates latency. Once vectors are stored, search is fast even at scale.
Data Modeling Considerations
Storing Embeddings Alongside Metadata
One approach: add the embedding column to the primary videos table.
ALTER TABLE killrvideo.videos
ADD embedding vector<float, 4096>;
Trade-off: Embedding columns are large (4096 floats × 4 bytes = ~16KB per row). For tables with many rows, this increases storage and scan costs.
Dedicated Vector Table
Another approach (used by KillrVideo): a separate video_vectors table with only the embedding and essential metadata.
CREATE TABLE killrvideo.video_vectors (
videoid uuid PRIMARY KEY,
embedding vector<float, 4096>,
-- Minimal metadata for result assembly
title text,
added_date timestamp
);
Benefits:
- Primary
videostable rows stay small - Vector index is built on a smaller, focused table
- Vector-specific operations don't compete with general video queries
Keeping Embeddings Fresh
When a video's description or tags change, its embedding should be regenerated:
async def update_video_embedding(video_id: str, new_description: str):
# Regenerate embedding for updated content
new_embedding = await get_embedding(new_description)
await video_vectors_table.update_one(
filter={"videoid": video_id},
update={"$set": {"embedding": new_embedding}}
)
Stale embeddings cause recommendation drift — the system thinks a video is about one topic when it's been updated to cover another.
Further Learning
- Astra DB Vector Search Documentation
- NVIDIA NV-Embed-QA Model
- HNSW Algorithm Paper — the graph structure underlying ANN search
- Astra Vectorize Feature
- Cosine Similarity Explained