BM25 Full-Text Search Roadmap for Astra DB
Overview
This document outlines how to add BM25 (Best Match 25) full-text search to the KillrVideo application using Astra DB and Cassandra's Storage-Attached Indexes (SAI).
What is BM25? A ranking algorithm for full-text search that scores documents based on term frequency, document length, and corpus statistics. It's the algorithm behind most modern search engines (Elasticsearch, Solr, etc.).
Current State: KillrVideo uses:
- Vector search (semantic): For meaning-based queries
- Keyword search (regex): For simple text matching
Goal: Add BM25 for production-quality full-text search with ranking.
Why BM25?
Comparison of Search Methods
| Feature | Keyword (Regex) | Vector (Semantic) | BM25 (Full-Text) |
|---|---|---|---|
| Exact matches | Excellent | Poor | Excellent |
| Fuzzy matches | None | Excellent | Limited |
| Phrase search | Basic | None | Excellent |
| Ranking quality | None | Excellent | Excellent |
| Query speed | 10-50ms | 50-200ms | 5-30ms |
| Storage overhead | Minimal | 16KB/doc | Moderate |
| Setup complexity | Easy | Complex | Moderate |
When to use BM25:
- Product catalogs (e-commerce)
- Document search (knowledge bases)
- Multi-field queries ("search title AND description")
- Precise ranking based on term importance
When to use Vector:
- Natural language queries ("videos about...")
- Conceptual similarity
- Cross-language search
- Recommendation systems
Best practice: Use both together!
Implementation Roadmap
Phase 1: Enable SAI Text Indexes (Basic)
Timeline: 1-2 weeks
Goal: Replace regex search with SAI text indexes
Step 1.1: Create SAI Indexes
-- Index video names for text search
CREATE CUSTOM INDEX IF NOT EXISTS videos_name_text_idx
ON killrvideo.videos(name)
USING 'StorageAttachedIndex';
-- Index descriptions
CREATE CUSTOM INDEX IF NOT EXISTS videos_description_text_idx
ON killrvideo.videos(description)
USING 'StorageAttachedIndex';
-- Index tags (already exists for filtering, reuse for text)
CREATE CUSTOM INDEX IF NOT EXISTS videos_tags_idx
ON killrvideo.videos(tags)
USING 'StorageAttachedIndex';
Step 1.2: Update Search Service
async def search_videos_by_text(
query: str,
page: int,
page_size: int
) -> Tuple[List[VideoSummary], int]:
"""Full-text search across name, description, tags using SAI."""
videos_table = await get_table("videos")
# Multi-field search using $or
cursor = videos_table.find(
filter={
"$or": [
{"name": {"$regex": query, "$options": "i"}},
{"description": {"$regex": query, "$options": "i"}},
{"tags": query} # Exact tag match
]
},
limit=page_size,
skip=(page - 1) * page_size
)
docs = await cursor.to_list()
summaries = [VideoSummary.model_validate(d) for d in docs]
return summaries, len(docs)
Limitations:
- No relevance scoring (results not ranked)
- Case-insensitive regex (not true full-text)
- No stemming ("running" won't match "run")
Phase 2: Add Relevance Scoring (Intermediate)
Timeline: 2-4 weeks
Goal: Rank results by relevance using client-side scoring
Step 2.1: Implement TF-IDF Scoring
TF-IDF (Term Frequency-Inverse Document Frequency) is a simplified version of BM25.
import math
from collections import Counter
from typing import List, Dict
def compute_tfidf_score(
query: str,
document: str,
corpus_size: int,
term_doc_counts: Dict[str, int]
) -> float:
"""Compute TF-IDF score for a document given a query."""
query_terms = query.lower().split()
doc_terms = document.lower().split()
doc_term_counts = Counter(doc_terms)
score = 0.0
for term in query_terms:
# Term Frequency (TF)
tf = doc_term_counts.get(term, 0) / max(len(doc_terms), 1)
# Inverse Document Frequency (IDF)
doc_count = term_doc_counts.get(term, 1)
idf = math.log(corpus_size / doc_count)
score += tf * idf
return score
Limitations:
- Client-side scoring (slow for large result sets)
- Multiple queries needed (corpus stats)
- Not true BM25 (missing saturation, length normalization)
Phase 3: True BM25 with Cassandra 5.x (Advanced)
Timeline: 4-8 weeks (requires Cassandra upgrade)
Goal: Server-side BM25 scoring using Cassandra 5.x SAI improvements
Note: This requires Cassandra 5.x features that may not be fully available in current Astra DB.
Step 3.1: Create BM25-Enabled Indexes
Hypothetical future CQL (syntax may vary):
CREATE CUSTOM INDEX videos_fulltext_bm25_idx
ON killrvideo.videos(name, description)
USING 'StorageAttachedIndex'
WITH OPTIONS = {
'analyzer': 'standard', -- Tokenization + stemming
'scoring': 'bm25', -- Ranking algorithm
'k1': 1.2, -- BM25 parameter (term saturation)
'b': 0.75 -- BM25 parameter (length normalization)
};
Parameters:
- k1 (1.0-2.0): Controls term frequency saturation (default 1.2)
- b (0.0-1.0): Controls document length normalization (default 0.75)
Step 3.2: Query with BM25 Scoring
Hypothetical future Data API syntax:
cursor = videos_table.find(
filter={
"$text": {
"query": "python tutorial beginners",
"fields": ["name", "description"],
"analyzer": "standard"
}
},
sort={"$score": -1}, # Sort by relevance score
limit=page_size,
include_score=True # Return BM25 score
)
Benefits:
- Server-side scoring (fast)
- True BM25 algorithm
- Phrase queries, stemming, stopwords
- Multi-field boosting (name > description)
Hybrid Search Architecture (Recommended)
The best approach combines all three search methods:
async def search_videos_hybrid(
query: str,
page: int,
page_size: int,
weights: Dict[str, float] = None
) -> Tuple[List[VideoSummary], int]:
"""Hybrid search combining BM25, vector, and keyword."""
if weights is None:
weights = {"bm25": 0.5, "vector": 0.3, "keyword": 0.2}
# Execute all three searches in parallel
bm25_results, vector_results, keyword_results = await asyncio.gather(
search_with_bm25(query, page, page_size * 2),
search_with_vector(query, page, page_size * 2),
search_with_keyword(query, page, page_size * 2)
)
# Normalize scores to 0-1 range
normalized = {
"bm25": _normalize_scores(bm25_results),
"vector": _normalize_scores(vector_results),
"keyword": _normalize_scores(keyword_results)
}
# Combine scores with weights
combined_scores = {}
for method, results in normalized.items():
for video_id, score in results.items():
if video_id not in combined_scores:
combined_scores[video_id] = 0.0
combined_scores[video_id] += score * weights[method]
# Sort by combined score
ranked_videos = sorted(
combined_scores.items(),
key=lambda x: x[1],
reverse=True
)
return videos, len(ranked_videos)
Benefits:
- BM25: Handles exact term matches, phrases
- Vector: Handles semantic queries, synonyms
- Keyword: Fallback for simple queries
Tunable weights: Adjust based on user feedback and A/B testing.
Implementation Checklist
Phase 1: SAI Text Indexes
- [ ] Create SAI indexes on
name,description,tags - [ ] Update
search_videos_by_keyword()to use multi-field $or query - [ ] Add search mode parameter:
?mode=text - [ ] Test case-insensitive search
- [ ] Deploy and monitor performance
Phase 2: Client-Side Ranking
- [ ] Implement TF-IDF scoring function
- [ ] Build term document count index
- [ ] Add scoring to search results
- [ ] Benchmark performance (should be < 100ms for 1000 docs)
- [ ] Compare ranking quality vs keyword search
- [ ] Add caching for corpus statistics
Phase 3: Server-Side BM25 (Future)
- [ ] Monitor Cassandra 5.x BM25 feature availability in Astra
- [ ] Test BM25 index creation in staging
- [ ] Migrate from client-side to server-side scoring
- [ ] Tune BM25 parameters (k1, b)
- [ ] A/B test BM25 vs TF-IDF
- [ ] Document new query syntax
Phase 4: Hybrid Search
- [ ] Implement hybrid search combining BM25 + vector + keyword
- [ ] Add weight tuning interface
- [ ] A/B test hybrid vs individual methods
- [ ] Optimize for latency (parallel queries)
- [ ] Add result diversity logic
- [ ] Monitor search quality metrics (click-through rate, etc.)
Performance Considerations
Storage
| Method | Storage per 1M Videos |
|---|---|
| Keyword (regex) | ~100 MB (index overhead) |
| TF-IDF (client-side) | ~500 MB (term statistics) |
| BM25 (server-side) | ~500 MB (inverted index) |
| Vector | ~16 GB (embeddings) |
Query Latency
| Method | Expected Latency | Scalability |
|---|---|---|
| Keyword (SAI) | 10-50ms | Linear with corpus |
| TF-IDF (client) | 50-200ms | Slow for large corpus |
| BM25 (server) | 10-40ms | Sub-linear with index |
| Vector | 50-200ms | Depends on ANN algorithm |
| Hybrid | 60-250ms | Parallel execution helps |
Cost
| Method | Cost Factor |
|---|---|
| Keyword | Low (index storage) |
| TF-IDF | Medium (compute for scoring) |
| BM25 | Medium (index storage + compute) |
| Vector | High (embeddings + compute) |
Testing Strategy
Unit Tests
# Test BM25 scoring
def test_bm25_score():
doc = "python tutorial for beginners"
query = "python tutorial"
score = compute_bm25_score(query, doc, corpus_size=1000, ...)
assert score > 0
assert score > compute_bm25_score("java tutorial", doc, ...)
# Test ranking
async def test_search_ranking():
results = await search_with_bm25("python tutorial", page=1, page_size=10)
# First result should be most relevant
assert "python" in results[0].name.lower()
# Scores should be descending
for i in range(len(results) - 1):
assert results[i].score >= results[i+1].score
Migration Path
For Existing Applications
-
Add BM25 alongside existing search (don't replace immediately)
# Support both modes if mode == "bm25": results = search_with_bm25(query) else: results = search_with_keyword(query) # Keep old behavior -
A/B test with real users (50/50 traffic split)
-
Monitor metrics:
- Search latency (p50, p95, p99)
- Result click-through rate
- Zero-result queries
- User satisfaction scores
-
Gradually increase BM25 traffic (0% → 10% → 50% → 100%)
-
Make BM25 the default after validation
Backward Compatibility
# Support old API
@router.get("/search/videos")
async def search_videos(
query: str,
mode: Literal["keyword", "semantic", "bm25"] = "bm25", # BM25 default
legacy: bool = False # Force old keyword search
):
if legacy:
return await search_videos_by_keyword(query)
# New unified search
return await search_videos_unified(query, mode)
Resources
Cassandra Documentation
BM25 Algorithm
Next Steps
-
Evaluate current needs: Is keyword search sufficient, or do you need ranking?
-
Check Astra roadmap: When will BM25 features be available?
-
Start with Phase 1: SAI text indexes are available now
-
Plan for Phase 2: Client-side TF-IDF for interim solution
-
Future-proof for Phase 3: Design APIs to support server-side BM25 later
-
Consider hybrid: Most production systems use multiple search methods
Status: Roadmap (BM25 server-side features pending in Astra DB)