Hybrid Search in Apache Solr - Learning Notes
What is This About?
This post explores hybrid search and reranking in Apache Solr. If you’re new to these concepts:
- Keyword search (also called lexical search) finds documents by matching exact words or phrases
- Vector search (also called semantic search) finds documents by understanding meaning and similarity
- Hybrid search combines both approaches to get the best of both worlds
- Reranking is a technique where you first retrieve candidates using one method, then reorder them using another method
Why Reranking Matters
Imagine you’re searching for “how to fix memory leaks in Kubernetes”.
- Keyword search alone might miss relevant docs that use different terminology (e.g., “memory management” instead of “memory leaks”)
- Vector search alone might return semantically similar but irrelevant docs (e.g., general memory management articles)
- Reranking lets you use keyword search to find relevant candidates, then use vector search to surface the most semantically relevant ones
Context & Goal
- Background: Search practitioner, intermediate Python coder, familiar with lexical search in Solr
- Goal: Understand hybrid search and re-ranking features in Solr
- Application: Lightspeed core implementation for OpenShift documentation
Understanding the Reranking Approach
This implementation uses a keyword-first hybrid search strategy. Let’s break down what that means and how it works.
The Two-Stage Process
Stage 1: Keyword Retrieval (Broad Cast)
- Use traditional keyword search to find candidate documents
- Retrieve
k*2documents (twice as many as you need) - This acts as a filter: only documents matching your keywords are considered
Stage 2: Semantic Reranking (Refinement)
- Take those
k*2candidates from Stage 1 - Use vector/semantic similarity to reorder them
- Return the top
kdocuments based on the combined score
High-Level Flow
User Query: "how to deploy nodejs on openshift"
↓
Stage 1: Keyword Search
→ Find top k*2 documents matching "deploy", "nodejs", "openshift"
→ Example: Gets 20 documents (if k=10)
↓
Stage 2: Semantic Reranking
→ Calculate semantic similarity for those 20 documents
→ Reorder by combining keyword score + semantic score
↓
Final Results: Top k documents (10 in this case)
Why Retrieve k*2 First?
Retrieving k*2 candidates gives the reranker a larger pool to work with. This is important because:
- The keyword search might rank documents highly that aren’t semantically the best match
- The reranker can “rescue” semantically relevant documents that ranked lower in keyword search
- It’s a balance: too few candidates = missed opportunities, too many = slower performance
Librarian Analogy
Imagine you’re asking two librarians to help you find books:
- Librarian #1 (Keyword Search):
- You ask: “Find books about deploying applications”
- They search the catalog by keywords and bring you 20 books
- They put them on a table, roughly sorted by how many times “deploy” and “application” appear
- Librarian #2 (Vector Reranker):
- Takes those same 20 books from the table
- Reads through them to understand the actual content and meaning
- Reorders them based on how well they match what you’re really looking for
- Gives you the top 10 most relevant books
The key insight: Librarian #2 can only work with what Librarian #1 found. If a book doesn’t match the keywords, it never makes it to the table.
Reference Implementation
- Lightspeed implementation: solr_vector_io/solr.py
Code Implementation Details
Now let’s look at how this is actually implemented in code.
The Function Signature
async def query_hybrid(
embedding: NDArray, # Query vector (converted from text to numbers)
query_string: str, # Original query text for keyword search
k: int, # Final number of results wanted
score_threshold: float, # Minimum score to include a result
reranker_type: str, # Type of reranking strategy
reranker_params: dict # Contains boost values (reRankWeight, etc.)
)Key inputs:
embedding: The query converted to a vector (array of numbers) that represents its meaningquery_string: The original text query for keyword matchingk: How many final results you want (e.g., 10)reranker_params: Configuration likereRankWeightthat controls how much semantic similarity matters
Solr Query Parameters Explained
Here’s what gets sent to Solr:
data_params = {
# Stage 1: Initial keyword retrieval
"q": query_string, # Your keyword query (e.g., "deploy nodejs")
"defType": "edismax", # Extended DisMax parser (flexible keyword matching)
"rows": k, # Final result count (but we'll rerank k*2 first)
# Stage 2: Reranking configuration
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
# rq = rerank query instruction
# reRankQuery=$rqq = use the query defined in rqq parameter
# reRankDocs={k*2} = rerank the top k*2 documents from keyword search
# reRankWeight={vector_boost} = how much to weight semantic score vs keyword score
"rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",
# rqq = the actual rerank query (KNN = K-Nearest Neighbors, a vector similarity search)
# f={vector_field} = which field contains the document vectors
# topK={k*2} = consider top k*2 candidates
# {vector_str} = the query vector as a string
# Other parameters
"fl": "*, score", # Return all fields + relevance score
"fq": ["product:*openshift*"], # Filter query (only OpenShift docs)
"wt": "json" # Response format (JSON)
}Understanding the Key Parameters
| Parameter | What It Does | Example Value | Why It Matters |
|---|---|---|---|
q |
The keyword search query | "deploy nodejs openshift" |
Finds initial candidates based on word matches |
rq |
Rerank instruction | "{!rerank ...}" |
Tells Solr to rerank results |
reRankDocs |
How many docs to rerank | 20 (if k=10) |
Larger pool = better reranking, but slower |
reRankQuery |
What to use for reranking | $rqq (references rqq param) |
Points to the vector similarity query |
reRankWeight |
Semantic score importance | 5.0 (medium) |
Controls balance: low = keyword wins, high = semantic wins |
rqq |
The vector similarity query | "{!knn f=vector topK=20}..." |
Performs semantic search on candidates |
How reRankWeight Works
The reRankWeight parameter is crucial. It controls how the final score is calculated:
final_score = keyword_score + (reRankWeight × semantic_score)
Examples:
reRankWeight = 1: Semantic score has equal weight to keyword scorereRankWeight = 5: Semantic score is 5× more important (balanced approach)reRankWeight = 20: Semantic score dominates (for conceptual queries)
Why this matters: Different types of queries need different balances. A query like “CVE-2024-1234” needs exact keyword matching (low weight), while “how to improve security” benefits from semantic understanding (high weight).
Choosing the Right Reranking Strategy
One of the key insights from this implementation is that different query types need different reranking strategies. You can’t use the same reRankWeight for everything.
Why One Size Doesn’t Fit All
Consider these three queries:
"CVE-2024-1234"- You want the exact security advisory"how to improve application performance"- You want conceptually relevant guides"how to patch CVE-2024-1234"- You need both the exact CVE and conceptual guidance
Each needs a different balance between keyword matching and semantic understanding.
| Strategy | reRankWeight | Best For | Example Queries |
|---|---|---|---|
| Exact Match | 1-2 | Specific IDs, codes, commands | "CVE-2024-1234", "error 404" |
| Balanced | 5-8 | Tech + action combinations | "deploy nodejs on k8s" |
| Semantic Heavy | 15-20 | Concepts, how-to, best practices | "how to improve performance" |
Strategy 1: Exact Technical Queries (Low Semantic Weight)
When to use: Queries that require precise keyword matching
Examples:
"CVE-2024-1234"- Specific security advisory ID"error code 404"- Exact error code"kubectl get pods"- Specific command syntax"API endpoint /v1/users"- Exact API path
Strategy: Low reRankWeight (1-2)
Why:
- These queries have very specific, unambiguous intent
- Exact keyword matches are more important than semantic similarity
- You don’t want semantic search to “helpfully” return similar but different CVEs or error codes
- The keyword search already finds the right documents; reranking should only make minor adjustments
Example scenario:
Query: "CVE-2024-1234"
Keyword search finds: Document about CVE-2024-1234 (score: 10.0)
Document about CVE-2024-1235 (score: 8.0) # Similar but wrong!
With low reRankWeight (1.0):
- CVE-2024-1234 stays on top (keyword score dominates)
- CVE-2024-1235 stays lower (even if semantically similar)
With high reRankWeight (20.0):
- Risk: CVE-2024-1235 might jump ahead if it's semantically similar
- Problem: User gets wrong CVE!
Strategy 2: Conceptual Queries (High Semantic Weight)
When to use: Queries about concepts, best practices, or “how-to” questions
Examples:
"how to improve performance"- Broad conceptual question"best practices for security"- General guidance"troubleshooting slow deployments"- Problem-solving query"scaling applications"- Conceptual topic
Strategy: High reRankWeight (15-20)
Why:
- These queries are about concepts, not exact terms
- Users might use different words than the documentation
- Semantic understanding helps find relevant content even if terminology differs
- Keyword search might miss relevant docs that use synonyms or related terms
Example scenario:
Query: "how to improve performance"
Keyword search finds: Doc mentioning "improve performance" (score: 9.0)
Doc about "optimization techniques" (score: 6.0) # Relevant but different words!
With low reRankWeight (1.0):
- "improve performance" doc stays on top
- "optimization techniques" stays lower (missed opportunity)
With high reRankWeight (20.0):
- "optimization techniques" jumps ahead (semantically very relevant)
- User gets better results!
Strategy 3: Mixed Queries (Balanced Weight)
When to use: Queries that combine specific terms with conceptual needs
Examples:
"how to patch CVE-2024-1234"- Specific CVE + general patching guidance"deploy nodejs on kubernetes"- Specific technologies + deployment concept"troubleshoot openshift authentication errors"- Specific product + general troubleshooting"configure SSL for nginx"- Specific tech + configuration concept
Strategy: Medium reRankWeight (5-8)
Why:
- Need to match specific keywords (technology names, product names, error codes)
- But also benefit from semantic understanding of the action/concept
- Balance ensures specific terms are matched while still finding conceptually relevant content
Example scenario:
Query: "deploy nodejs on kubernetes"
Keyword search finds: "Deploying Node.js on Kubernetes" (score: 10.0)
"Running Node.js apps in K8s" (score: 7.0) # Different words, same concept
With medium reRankWeight (6.0):
- Both documents are considered
- Exact match stays high, but semantic match can surface if very relevant
- Good balance between precision and recall
Decision Framework
Putting It All Together: A Practical Implementation Plan
Now that we understand the concepts, let’s see how to implement this in practice.
The Three-Tier Classification System
Instead of trying to pick the perfect reRankWeight for every query, we can classify queries into three tiers:
| Tier | Query Characteristics | reRankWeight | When to Use |
|---|---|---|---|
| Exact Match Critical | Security IDs (CVE, Errata), error codes, exact commands | 1-2 | Queries that must match exact keywords |
| Balanced | Technology + action combinations, mixed queries | 5-8 | Default for most queries (covers majority of cases) |
| Semantic Heavy | Questions, how-to guides, best practices, troubleshooting | 15-20 | Conceptual queries where meaning matters most |
How Classification Works
Example Classification Logic:
def classify_query(query: str) -> str:
# Exact match critical: CVE, Errata, specific error codes
if re.search(r'CVE-\d{4}-\d+', query) or 'errata' in query.lower():
return "exact_match"
# Semantic heavy: questions, how-to, best practices
if query.lower().startswith(('how', 'what', 'why', 'when')) or \
'best practice' in query.lower() or 'troubleshoot' in query.lower():
return "semantic_heavy"
# Default: balanced
return "balanced"
# Map to reRankWeight
weight_map = {
"exact_match": 1.5,
"balanced": 6.0,
"semantic_heavy": 18.0
}Implementation Steps
- Extend intent detection to classify queries into three tiers
- Use pattern matching (regex, keywords)
- Leverage existing intent detection if available
- Start simple, refine based on data
- Map each tier to reRankWeight value
- Start with suggested ranges (1-2, 5-8, 15-20)
- Fine-tune based on your specific use case
- Test on historical query logs
- Run queries through both old and new systems
- Compare result quality (relevance, user satisfaction)
- Measure performance impact
- Monitor and iterate
- Track which queries get which classification
- Collect user feedback on result quality
- Adjust weights and classification rules based on data
Advantages of This Approach
- Practical starting point: Three tiers cover most use cases without being too complex
- Data-driven refinement: Start with defaults, improve based on real queries
- Explainable: Easy to understand why a query got a certain weight
- Extensible: Can add more tiers or dynamic weights later
Alternative Approaches (For Future Learning)
This implementation uses keyword-first reranking, but there are other hybrid search strategies:
- Union-based: Run keyword and vector search separately, merge results
- RRF (Reciprocal Rank Fusion): Combine rankings from multiple search methods
- Learning to Rank (LTR): Use machine learning to automatically optimize weights
- Dynamic weights: Adjust
reRankWeightbased on query features (length, term frequency, etc.)
Key Takeaways
Reranking is a two-stage process: Keyword search finds candidates, semantic search refines the ranking
reRankWeight controls the balance: It determines how much semantic similarity matters vs. keyword matching
- Low (1-2): Keyword matching dominates
- Medium (5-8): Balanced approach
- High (15-20): Semantic similarity dominates
Different query types need different strategies:
- Exact technical queries → Low weight
- Conceptual queries → High weight
- Mixed queries → Medium weight
Start simple, iterate based on data:
- Three tiers is a practical starting point
- Refine weights and classification rules based on real query performance
This is keyword-first hybrid:
- Only documents matching keywords are considered
- Reranking refines within that set
- This is different from union-based approaches that merge separate results
Why retrieve
k*2candidates?- Gives reranker a larger pool to work with
- Allows semantically relevant docs to “rescue” from lower keyword ranks
- Balance between quality and performance
Next Learning Topics
- Experiment design: How to systematically test reranking strategies with query logs
- Alternative hybrid approaches: Union-based search, RRF (Reciprocal Rank Fusion)
- Dynamic reRankWeight: Adjusting weights based on query features automatically
- Learning to Rank (LTR): Using machine learning to optimize reranking weights
- Performance optimization: Balancing reranking quality with query latency
Reference Materials
- Sease.io blog: Hybrid Search with Apache Solr - Comprehensive guide to hybrid search concepts
- Lightspeed implementation: solr_vector_io/solr.py - Real-world code example
- Solveit Dialog: Hybrid Search in Solr - Interactive learning resource ```