Hybrid Search in Apache Solr - Learning Notes

What is This About?

This post explores hybrid search and reranking in Apache Solr. If you’re new to these concepts:

Keyword search (also called lexical search) finds documents by matching exact words or phrases
Vector search (also called semantic search) finds documents by understanding meaning and similarity
Hybrid search combines both approaches to get the best of both worlds
Reranking is a technique where you first retrieve candidates using one method, then reorder them using another method

Why Reranking Matters

The Problem with Single-Method Search

Imagine you’re searching for “how to fix memory leaks in Kubernetes”.

Keyword search alone might miss relevant docs that use different terminology (e.g., “memory management” instead of “memory leaks”)
Vector search alone might return semantically similar but irrelevant docs (e.g., general memory management articles)
Reranking lets you use keyword search to find relevant candidates, then use vector search to surface the most semantically relevant ones

Context & Goal

Background: Search practitioner, intermediate Python coder, familiar with lexical search in Solr
Goal: Understand hybrid search and re-ranking features in Solr
Application: Lightspeed core implementation for OpenShift documentation

Understanding the Reranking Approach

This implementation uses a keyword-first hybrid search strategy. Let’s break down what that means and how it works.

The Two-Stage Process

Stage 1: Keyword Retrieval (Broad Cast)

Use traditional keyword search to find candidate documents
Retrieve k*2 documents (twice as many as you need)
This acts as a filter: only documents matching your keywords are considered

High-Level Flow

User Query: "how to deploy nodejs on openshift"
    ↓
Stage 1: Keyword Search
    → Find top k*2 documents matching "deploy", "nodejs", "openshift"
    → Example: Gets 20 documents (if k=10)
    ↓
Stage 2: Semantic Reranking  
    → Calculate semantic similarity for those 20 documents
    → Reorder by combining keyword score + semantic score
    ↓
Final Results: Top k documents (10 in this case)

Why Retrieve `k*2` First?

The k×2 Strategy

Retrieving k*2 candidates gives the reranker a larger pool to work with. This is important because:

The keyword search might rank documents highly that aren’t semantically the best match
The reranker can “rescue” semantically relevant documents that ranked lower in keyword search
It’s a balance: too few candidates = missed opportunities, too many = slower performance

Librarian Analogy

Imagine you’re asking two librarians to help you find books:

Librarian #1 (Keyword Search):
- You ask: “Find books about deploying applications”
- They search the catalog by keywords and bring you 20 books
- They put them on a table, roughly sorted by how many times “deploy” and “application” appear
Librarian #2 (Vector Reranker):
- Takes those same 20 books from the table
- Reads through them to understand the actual content and meaning
- Reorders them based on how well they match what you’re really looking for
- Gives you the top 10 most relevant books

The key insight: Librarian #2 can only work with what Librarian #1 found. If a book doesn’t match the keywords, it never makes it to the table.

Reference Implementation

Lightspeed implementation: solr_vector_io/solr.py

Code Implementation Details

Now let’s look at how this is actually implemented in code.

The Function Signature

async def query_hybrid(
    embedding: NDArray,           # Query vector (converted from text to numbers)
    query_string: str,             # Original query text for keyword search
    k: int,                        # Final number of results wanted
    score_threshold: float,        # Minimum score to include a result
    reranker_type: str,           # Type of reranking strategy
    reranker_params: dict          # Contains boost values (reRankWeight, etc.)
)

Key inputs:

embedding: The query converted to a vector (array of numbers) that represents its meaning
query_string: The original text query for keyword matching
k: How many final results you want (e.g., 10)
reranker_params: Configuration like reRankWeight that controls how much semantic similarity matters

Solr Query Parameters Explained

Here’s what gets sent to Solr:

data_params = {
    # Stage 1: Initial keyword retrieval
    "q": query_string,                    # Your keyword query (e.g., "deploy nodejs")
    "defType": "edismax",                 # Extended DisMax parser (flexible keyword matching)
    "rows": k,                            # Final result count (but we'll rerank k*2 first)
    
    # Stage 2: Reranking configuration
    "rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
    # rq = rerank query instruction
    # reRankQuery=$rqq = use the query defined in rqq parameter
    # reRankDocs={k*2} = rerank the top k*2 documents from keyword search
    # reRankWeight={vector_boost} = how much to weight semantic score vs keyword score
    
    "rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",
    # rqq = the actual rerank query (KNN = K-Nearest Neighbors, a vector similarity search)
    # f={vector_field} = which field contains the document vectors
    # topK={k*2} = consider top k*2 candidates
    # {vector_str} = the query vector as a string
    
    # Other parameters
    "fl": "*, score",                     # Return all fields + relevance score
    "fq": ["product:*openshift*"],       # Filter query (only OpenShift docs)
    "wt": "json"                          # Response format (JSON)
}

Understanding the Key Parameters

Parameter	What It Does	Example Value	Why It Matters
`q`	The keyword search query	`"deploy nodejs openshift"`	Finds initial candidates based on word matches
`rq`	Rerank instruction	`"{!rerank ...}"`	Tells Solr to rerank results
`reRankDocs`	How many docs to rerank	`20` (if k=10)	Larger pool = better reranking, but slower
`reRankQuery`	What to use for reranking	`$rqq` (references rqq param)	Points to the vector similarity query
`reRankWeight`	Semantic score importance	`5.0` (medium)	Controls balance: low = keyword wins, high = semantic wins
`rqq`	The vector similarity query	`"{!knn f=vector topK=20}..."`	Performs semantic search on candidates

How reRankWeight Works

The reRankWeight parameter is crucial. It controls how the final score is calculated:

Score Formula

final_score = keyword_score + (reRankWeight × semantic_score)

Examples:

reRankWeight = 1: Semantic score has equal weight to keyword score
reRankWeight = 5: Semantic score is 5× more important (balanced approach)
reRankWeight = 20: Semantic score dominates (for conceptual queries)

Why this matters: Different types of queries need different balances. A query like “CVE-2024-1234” needs exact keyword matching (low weight), while “how to improve security” benefits from semantic understanding (high weight).

Choosing the Right Reranking Strategy

One of the key insights from this implementation is that different query types need different reranking strategies. You can’t use the same reRankWeight for everything.

Why One Size Doesn’t Fit All

Consider these three queries:

"CVE-2024-1234" - You want the exact security advisory
"how to improve application performance" - You want conceptually relevant guides
"how to patch CVE-2024-1234" - You need both the exact CVE and conceptual guidance

Each needs a different balance between keyword matching and semantic understanding.

Quick Strategy Reference

Strategy	reRankWeight	Best For	Example Queries
Exact Match	1-2	Specific IDs, codes, commands	`"CVE-2024-1234"`, `"error 404"`
Balanced	5-8	Tech + action combinations	`"deploy nodejs on k8s"`
Semantic Heavy	15-20	Concepts, how-to, best practices	`"how to improve performance"`

Strategy 1: Exact Technical Queries (Low Semantic Weight)

When to use: Queries that require precise keyword matching

Examples:

"CVE-2024-1234" - Specific security advisory ID
"error code 404" - Exact error code
"kubectl get pods" - Specific command syntax
"API endpoint /v1/users" - Exact API path

Strategy: Low reRankWeight (1-2)

Why:

These queries have very specific, unambiguous intent
Exact keyword matches are more important than semantic similarity
You don’t want semantic search to “helpfully” return similar but different CVEs or error codes
The keyword search already finds the right documents; reranking should only make minor adjustments

Example scenario:

Query: "CVE-2024-1234"
Keyword search finds: Document about CVE-2024-1234 (score: 10.0)
                      Document about CVE-2024-1235 (score: 8.0)  # Similar but wrong!
                      
With low reRankWeight (1.0):
- CVE-2024-1234 stays on top (keyword score dominates)
- CVE-2024-1235 stays lower (even if semantically similar)

With high reRankWeight (20.0):
- Risk: CVE-2024-1235 might jump ahead if it's semantically similar
- Problem: User gets wrong CVE!

Strategy 2: Conceptual Queries (High Semantic Weight)

When to use: Queries about concepts, best practices, or “how-to” questions

Examples:

"how to improve performance" - Broad conceptual question
"best practices for security" - General guidance
"troubleshooting slow deployments" - Problem-solving query
"scaling applications" - Conceptual topic

Strategy: High reRankWeight (15-20)

Why:

These queries are about concepts, not exact terms
Users might use different words than the documentation
Semantic understanding helps find relevant content even if terminology differs
Keyword search might miss relevant docs that use synonyms or related terms

Example scenario:

Query: "how to improve performance"
Keyword search finds: Doc mentioning "improve performance" (score: 9.0)
                      Doc about "optimization techniques" (score: 6.0)  # Relevant but different words!
                      
With low reRankWeight (1.0):
- "improve performance" doc stays on top
- "optimization techniques" stays lower (missed opportunity)

With high reRankWeight (20.0):
- "optimization techniques" jumps ahead (semantically very relevant)
- User gets better results!

Strategy 3: Mixed Queries (Balanced Weight)

When to use: Queries that combine specific terms with conceptual needs

Examples:

"how to patch CVE-2024-1234" - Specific CVE + general patching guidance
"deploy nodejs on kubernetes" - Specific technologies + deployment concept
"troubleshoot openshift authentication errors" - Specific product + general troubleshooting
"configure SSL for nginx" - Specific tech + configuration concept

Strategy: Medium reRankWeight (5-8)

Why:

Need to match specific keywords (technology names, product names, error codes)
But also benefit from semantic understanding of the action/concept
Balance ensures specific terms are matched while still finding conceptually relevant content

Example scenario:

Query: "deploy nodejs on kubernetes"
Keyword search finds: "Deploying Node.js on Kubernetes" (score: 10.0)
                      "Running Node.js apps in K8s" (score: 7.0)  # Different words, same concept
                      
With medium reRankWeight (6.0):
- Both documents are considered
- Exact match stays high, but semantic match can surface if very relevant
- Good balance between precision and recall

Decision Framework

Quick Reference: Choosing reRankWeight

When choosing reRankWeight, ask yourself:

Is this query about a specific, unambiguous thing? (CVE, error code, exact command)
- → Use low weight (1-2)
Is this query about a concept or general topic? (how-to, best practices, troubleshooting)
- → Use high weight (15-20)
Does it combine specific terms with concepts? (specific tech + general action)
- → Use medium weight (5-8)

Putting It All Together: A Practical Implementation Plan

Now that we understand the concepts, let’s see how to implement this in practice.

The Three-Tier Classification System

Instead of trying to pick the perfect reRankWeight for every query, we can classify queries into three tiers:

Tier	Query Characteristics	reRankWeight	When to Use
Exact Match Critical	Security IDs (CVE, Errata), error codes, exact commands	1-2	Queries that must match exact keywords
Balanced	Technology + action combinations, mixed queries	5-8	Default for most queries (covers majority of cases)
Semantic Heavy	Questions, how-to guides, best practices, troubleshooting	15-20	Conceptual queries where meaning matters most

How Classification Works

Example Classification Logic:

def classify_query(query: str) -> str:
    # Exact match critical: CVE, Errata, specific error codes
    if re.search(r'CVE-\d{4}-\d+', query) or 'errata' in query.lower():
        return "exact_match"
    
    # Semantic heavy: questions, how-to, best practices
    if query.lower().startswith(('how', 'what', 'why', 'when')) or \
       'best practice' in query.lower() or 'troubleshoot' in query.lower():
        return "semantic_heavy"
    
    # Default: balanced
    return "balanced"

# Map to reRankWeight
weight_map = {
    "exact_match": 1.5,
    "balanced": 6.0,
    "semantic_heavy": 18.0
}

Implementation Steps

Extend intent detection to classify queries into three tiers
- Use pattern matching (regex, keywords)
- Leverage existing intent detection if available
- Start simple, refine based on data
Map each tier to reRankWeight value
- Start with suggested ranges (1-2, 5-8, 15-20)
- Fine-tune based on your specific use case
Test on historical query logs
- Run queries through both old and new systems
- Compare result quality (relevance, user satisfaction)
- Measure performance impact
Monitor and iterate
- Track which queries get which classification
- Collect user feedback on result quality
- Adjust weights and classification rules based on data

Advantages of This Approach

Why Three Tiers Work

Practical starting point: Three tiers cover most use cases without being too complex
Data-driven refinement: Start with defaults, improve based on real queries
Explainable: Easy to understand why a query got a certain weight
Extensible: Can add more tiers or dynamic weights later

Alternative Approaches (For Future Learning)

This implementation uses keyword-first reranking, but there are other hybrid search strategies:

Union-based: Run keyword and vector search separately, merge results
RRF (Reciprocal Rank Fusion): Combine rankings from multiple search methods
Learning to Rank (LTR): Use machine learning to automatically optimize weights
Dynamic weights: Adjust reRankWeight based on query features (length, term frequency, etc.)

Key Takeaways

Core Concepts

Reranking is a two-stage process: Keyword search finds candidates, semantic search refines the ranking
reRankWeight controls the balance: It determines how much semantic similarity matters vs. keyword matching
- Low (1-2): Keyword matching dominates
- Medium (5-8): Balanced approach
- High (15-20): Semantic similarity dominates
Different query types need different strategies:
- Exact technical queries → Low weight
- Conceptual queries → High weight
- Mixed queries → Medium weight
Start simple, iterate based on data:
- Three tiers is a practical starting point
- Refine weights and classification rules based on real query performance
This is keyword-first hybrid:
- Only documents matching keywords are considered
- Reranking refines within that set
- This is different from union-based approaches that merge separate results
Why retrieve k*2 candidates?
- Gives reranker a larger pool to work with
- Allows semantically relevant docs to “rescue” from lower keyword ranks
- Balance between quality and performance

Next Learning Topics

Experiment design: How to systematically test reranking strategies with query logs
Alternative hybrid approaches: Union-based search, RRF (Reciprocal Rank Fusion)
Dynamic reRankWeight: Adjusting weights based on query features automatically
Learning to Rank (LTR): Using machine learning to optimize reranking weights
Performance optimization: Balancing reranking quality with query latency

Reference Materials

Sease.io blog: Hybrid Search with Apache Solr - Comprehensive guide to hybrid search concepts
Lightspeed implementation: solr_vector_io/solr.py - Real-world code example
Solveit Dialog: Hybrid Search in Solr - Interactive learning resource ```

Hybrid Search in Apache Solr - Learning Notes

What is This About?

Why Reranking Matters

Context & Goal

Understanding the Reranking Approach

The Two-Stage Process

Stage 1: Keyword Retrieval (Broad Cast)

Stage 2: Semantic Reranking (Refinement)

High-Level Flow

Why Retrieve k*2 First?

Librarian Analogy

Reference Implementation

Code Implementation Details

The Function Signature

Solr Query Parameters Explained

Understanding the Key Parameters

How reRankWeight Works

Choosing the Right Reranking Strategy

Why One Size Doesn’t Fit All

Strategy 1: Exact Technical Queries (Low Semantic Weight)

Strategy 2: Conceptual Queries (High Semantic Weight)

Strategy 3: Mixed Queries (Balanced Weight)

Decision Framework

Putting It All Together: A Practical Implementation Plan

The Three-Tier Classification System

How Classification Works

Implementation Steps

Advantages of This Approach

Alternative Approaches (For Future Learning)

Key Takeaways

Next Learning Topics

Reference Materials

Why Retrieve `k*2` First?