Hybrid Search in Apache Solr - Learning Notes

Understanding hybrid search and reranking strategies in Apache Solr
information-retrieval
apache-solr
search
Published

November 22, 2025

Hybrid Search in Apache Solr - Learning Notes

What is This About?

This post explores hybrid search and reranking in Apache Solr. If you’re new to these concepts:

  • Keyword search (also called lexical search) finds documents by matching exact words or phrases
  • Vector search (also called semantic search) finds documents by understanding meaning and similarity
  • Hybrid search combines both approaches to get the best of both worlds
  • Reranking is a technique where you first retrieve candidates using one method, then reorder them using another method

Why Reranking Matters

NoteThe Problem with Single-Method Search

Imagine you’re searching for “how to fix memory leaks in Kubernetes”.

  • Keyword search alone might miss relevant docs that use different terminology (e.g., “memory management” instead of “memory leaks”)
  • Vector search alone might return semantically similar but irrelevant docs (e.g., general memory management articles)
  • Reranking lets you use keyword search to find relevant candidates, then use vector search to surface the most semantically relevant ones

Context & Goal

  • Background: Search practitioner, intermediate Python coder, familiar with lexical search in Solr
  • Goal: Understand hybrid search and re-ranking features in Solr
  • Application: Lightspeed core implementation for OpenShift documentation

Understanding the Reranking Approach

This implementation uses a keyword-first hybrid search strategy. Let’s break down what that means and how it works.

The Two-Stage Process

Stage 1: Keyword Retrieval (Broad Cast)

  • Use traditional keyword search to find candidate documents
  • Retrieve k*2 documents (twice as many as you need)
  • This acts as a filter: only documents matching your keywords are considered

Stage 2: Semantic Reranking (Refinement)

  • Take those k*2 candidates from Stage 1
  • Use vector/semantic similarity to reorder them
  • Return the top k documents based on the combined score

High-Level Flow

User Query: "how to deploy nodejs on openshift"
    ↓
Stage 1: Keyword Search
    → Find top k*2 documents matching "deploy", "nodejs", "openshift"
    → Example: Gets 20 documents (if k=10)
    ↓
Stage 2: Semantic Reranking  
    → Calculate semantic similarity for those 20 documents
    → Reorder by combining keyword score + semantic score
    ↓
Final Results: Top k documents (10 in this case)

Why Retrieve k*2 First?

TipThe k×2 Strategy

Retrieving k*2 candidates gives the reranker a larger pool to work with. This is important because:

  • The keyword search might rank documents highly that aren’t semantically the best match
  • The reranker can “rescue” semantically relevant documents that ranked lower in keyword search
  • It’s a balance: too few candidates = missed opportunities, too many = slower performance

Librarian Analogy

Imagine you’re asking two librarians to help you find books:

  • Librarian #1 (Keyword Search):
    • You ask: “Find books about deploying applications”
    • They search the catalog by keywords and bring you 20 books
    • They put them on a table, roughly sorted by how many times “deploy” and “application” appear
  • Librarian #2 (Vector Reranker):
    • Takes those same 20 books from the table
    • Reads through them to understand the actual content and meaning
    • Reorders them based on how well they match what you’re really looking for
    • Gives you the top 10 most relevant books

The key insight: Librarian #2 can only work with what Librarian #1 found. If a book doesn’t match the keywords, it never makes it to the table.

Reference Implementation

Code Implementation Details

Now let’s look at how this is actually implemented in code.

The Function Signature

async def query_hybrid(
    embedding: NDArray,           # Query vector (converted from text to numbers)
    query_string: str,             # Original query text for keyword search
    k: int,                        # Final number of results wanted
    score_threshold: float,        # Minimum score to include a result
    reranker_type: str,           # Type of reranking strategy
    reranker_params: dict          # Contains boost values (reRankWeight, etc.)
)

Key inputs:

  • embedding: The query converted to a vector (array of numbers) that represents its meaning
  • query_string: The original text query for keyword matching
  • k: How many final results you want (e.g., 10)
  • reranker_params: Configuration like reRankWeight that controls how much semantic similarity matters

Solr Query Parameters Explained

Here’s what gets sent to Solr:

data_params = {
    # Stage 1: Initial keyword retrieval
    "q": query_string,                    # Your keyword query (e.g., "deploy nodejs")
    "defType": "edismax",                 # Extended DisMax parser (flexible keyword matching)
    "rows": k,                            # Final result count (but we'll rerank k*2 first)
    
    # Stage 2: Reranking configuration
    "rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
    # rq = rerank query instruction
    # reRankQuery=$rqq = use the query defined in rqq parameter
    # reRankDocs={k*2} = rerank the top k*2 documents from keyword search
    # reRankWeight={vector_boost} = how much to weight semantic score vs keyword score
    
    "rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",
    # rqq = the actual rerank query (KNN = K-Nearest Neighbors, a vector similarity search)
    # f={vector_field} = which field contains the document vectors
    # topK={k*2} = consider top k*2 candidates
    # {vector_str} = the query vector as a string
    
    # Other parameters
    "fl": "*, score",                     # Return all fields + relevance score
    "fq": ["product:*openshift*"],       # Filter query (only OpenShift docs)
    "wt": "json"                          # Response format (JSON)
}

Understanding the Key Parameters

Parameter What It Does Example Value Why It Matters
q The keyword search query "deploy nodejs openshift" Finds initial candidates based on word matches
rq Rerank instruction "{!rerank ...}" Tells Solr to rerank results
reRankDocs How many docs to rerank 20 (if k=10) Larger pool = better reranking, but slower
reRankQuery What to use for reranking $rqq (references rqq param) Points to the vector similarity query
reRankWeight Semantic score importance 5.0 (medium) Controls balance: low = keyword wins, high = semantic wins
rqq The vector similarity query "{!knn f=vector topK=20}..." Performs semantic search on candidates

How reRankWeight Works

The reRankWeight parameter is crucial. It controls how the final score is calculated:

ImportantScore Formula
final_score = keyword_score + (reRankWeight × semantic_score)

Examples:

  • reRankWeight = 1: Semantic score has equal weight to keyword score
  • reRankWeight = 5: Semantic score is 5× more important (balanced approach)
  • reRankWeight = 20: Semantic score dominates (for conceptual queries)

Why this matters: Different types of queries need different balances. A query like “CVE-2024-1234” needs exact keyword matching (low weight), while “how to improve security” benefits from semantic understanding (high weight).

Choosing the Right Reranking Strategy

One of the key insights from this implementation is that different query types need different reranking strategies. You can’t use the same reRankWeight for everything.

Why One Size Doesn’t Fit All

Consider these three queries:

  1. "CVE-2024-1234" - You want the exact security advisory
  2. "how to improve application performance" - You want conceptually relevant guides
  3. "how to patch CVE-2024-1234" - You need both the exact CVE and conceptual guidance

Each needs a different balance between keyword matching and semantic understanding.

NoteQuick Strategy Reference
Strategy reRankWeight Best For Example Queries
Exact Match 1-2 Specific IDs, codes, commands "CVE-2024-1234", "error 404"
Balanced 5-8 Tech + action combinations "deploy nodejs on k8s"
Semantic Heavy 15-20 Concepts, how-to, best practices "how to improve performance"

Strategy 1: Exact Technical Queries (Low Semantic Weight)

When to use: Queries that require precise keyword matching

Examples:

  • "CVE-2024-1234" - Specific security advisory ID
  • "error code 404" - Exact error code
  • "kubectl get pods" - Specific command syntax
  • "API endpoint /v1/users" - Exact API path

Strategy: Low reRankWeight (1-2)

Why:

  • These queries have very specific, unambiguous intent
  • Exact keyword matches are more important than semantic similarity
  • You don’t want semantic search to “helpfully” return similar but different CVEs or error codes
  • The keyword search already finds the right documents; reranking should only make minor adjustments

Example scenario:

Query: "CVE-2024-1234"
Keyword search finds: Document about CVE-2024-1234 (score: 10.0)
                      Document about CVE-2024-1235 (score: 8.0)  # Similar but wrong!
                      
With low reRankWeight (1.0):
- CVE-2024-1234 stays on top (keyword score dominates)
- CVE-2024-1235 stays lower (even if semantically similar)

With high reRankWeight (20.0):
- Risk: CVE-2024-1235 might jump ahead if it's semantically similar
- Problem: User gets wrong CVE!

Strategy 2: Conceptual Queries (High Semantic Weight)

When to use: Queries about concepts, best practices, or “how-to” questions

Examples:

  • "how to improve performance" - Broad conceptual question
  • "best practices for security" - General guidance
  • "troubleshooting slow deployments" - Problem-solving query
  • "scaling applications" - Conceptual topic

Strategy: High reRankWeight (15-20)

Why:

  • These queries are about concepts, not exact terms
  • Users might use different words than the documentation
  • Semantic understanding helps find relevant content even if terminology differs
  • Keyword search might miss relevant docs that use synonyms or related terms

Example scenario:

Query: "how to improve performance"
Keyword search finds: Doc mentioning "improve performance" (score: 9.0)
                      Doc about "optimization techniques" (score: 6.0)  # Relevant but different words!
                      
With low reRankWeight (1.0):
- "improve performance" doc stays on top
- "optimization techniques" stays lower (missed opportunity)

With high reRankWeight (20.0):
- "optimization techniques" jumps ahead (semantically very relevant)
- User gets better results!

Strategy 3: Mixed Queries (Balanced Weight)

When to use: Queries that combine specific terms with conceptual needs

Examples:

  • "how to patch CVE-2024-1234" - Specific CVE + general patching guidance
  • "deploy nodejs on kubernetes" - Specific technologies + deployment concept
  • "troubleshoot openshift authentication errors" - Specific product + general troubleshooting
  • "configure SSL for nginx" - Specific tech + configuration concept

Strategy: Medium reRankWeight (5-8)

Why:

  • Need to match specific keywords (technology names, product names, error codes)
  • But also benefit from semantic understanding of the action/concept
  • Balance ensures specific terms are matched while still finding conceptually relevant content

Example scenario:

Query: "deploy nodejs on kubernetes"
Keyword search finds: "Deploying Node.js on Kubernetes" (score: 10.0)
                      "Running Node.js apps in K8s" (score: 7.0)  # Different words, same concept
                      
With medium reRankWeight (6.0):
- Both documents are considered
- Exact match stays high, but semantic match can surface if very relevant
- Good balance between precision and recall

Decision Framework

NoteQuick Reference: Choosing reRankWeight

When choosing reRankWeight, ask yourself:

  1. Is this query about a specific, unambiguous thing? (CVE, error code, exact command)
    • → Use low weight (1-2)
  2. Is this query about a concept or general topic? (how-to, best practices, troubleshooting)
    • → Use high weight (15-20)
  3. Does it combine specific terms with concepts? (specific tech + general action)
    • → Use medium weight (5-8)

Putting It All Together: A Practical Implementation Plan

Now that we understand the concepts, let’s see how to implement this in practice.

The Three-Tier Classification System

Instead of trying to pick the perfect reRankWeight for every query, we can classify queries into three tiers:

Tier Query Characteristics reRankWeight When to Use
Exact Match Critical Security IDs (CVE, Errata), error codes, exact commands 1-2 Queries that must match exact keywords
Balanced Technology + action combinations, mixed queries 5-8 Default for most queries (covers majority of cases)
Semantic Heavy Questions, how-to guides, best practices, troubleshooting 15-20 Conceptual queries where meaning matters most

How Classification Works

Example Classification Logic:

def classify_query(query: str) -> str:
    # Exact match critical: CVE, Errata, specific error codes
    if re.search(r'CVE-\d{4}-\d+', query) or 'errata' in query.lower():
        return "exact_match"
    
    # Semantic heavy: questions, how-to, best practices
    if query.lower().startswith(('how', 'what', 'why', 'when')) or \
       'best practice' in query.lower() or 'troubleshoot' in query.lower():
        return "semantic_heavy"
    
    # Default: balanced
    return "balanced"

# Map to reRankWeight
weight_map = {
    "exact_match": 1.5,
    "balanced": 6.0,
    "semantic_heavy": 18.0
}

Implementation Steps

  1. Extend intent detection to classify queries into three tiers
    • Use pattern matching (regex, keywords)
    • Leverage existing intent detection if available
    • Start simple, refine based on data
  2. Map each tier to reRankWeight value
    • Start with suggested ranges (1-2, 5-8, 15-20)
    • Fine-tune based on your specific use case
  3. Test on historical query logs
    • Run queries through both old and new systems
    • Compare result quality (relevance, user satisfaction)
    • Measure performance impact
  4. Monitor and iterate
    • Track which queries get which classification
    • Collect user feedback on result quality
    • Adjust weights and classification rules based on data

Advantages of This Approach

TipWhy Three Tiers Work
  • Practical starting point: Three tiers cover most use cases without being too complex
  • Data-driven refinement: Start with defaults, improve based on real queries
  • Explainable: Easy to understand why a query got a certain weight
  • Extensible: Can add more tiers or dynamic weights later

Alternative Approaches (For Future Learning)

This implementation uses keyword-first reranking, but there are other hybrid search strategies:

  1. Union-based: Run keyword and vector search separately, merge results
  2. RRF (Reciprocal Rank Fusion): Combine rankings from multiple search methods
  3. Learning to Rank (LTR): Use machine learning to automatically optimize weights
  4. Dynamic weights: Adjust reRankWeight based on query features (length, term frequency, etc.)

Key Takeaways

ImportantCore Concepts
  1. Reranking is a two-stage process: Keyword search finds candidates, semantic search refines the ranking

  2. reRankWeight controls the balance: It determines how much semantic similarity matters vs. keyword matching

    • Low (1-2): Keyword matching dominates
    • Medium (5-8): Balanced approach
    • High (15-20): Semantic similarity dominates
  3. Different query types need different strategies:

    • Exact technical queries → Low weight
    • Conceptual queries → High weight
    • Mixed queries → Medium weight
  4. Start simple, iterate based on data:

    • Three tiers is a practical starting point
    • Refine weights and classification rules based on real query performance
  5. This is keyword-first hybrid:

    • Only documents matching keywords are considered
    • Reranking refines within that set
    • This is different from union-based approaches that merge separate results
  6. Why retrieve k*2 candidates?

    • Gives reranker a larger pool to work with
    • Allows semantically relevant docs to “rescue” from lower keyword ranks
    • Balance between quality and performance

Next Learning Topics

  • Experiment design: How to systematically test reranking strategies with query logs
  • Alternative hybrid approaches: Union-based search, RRF (Reciprocal Rank Fusion)
  • Dynamic reRankWeight: Adjusting weights based on query features automatically
  • Learning to Rank (LTR): Using machine learning to optimize reranking weights
  • Performance optimization: Balancing reranking quality with query latency

Reference Materials