Daniel Gray

Thoughts, Notes, Ideas, Projects

Contact

How the Search Engine Works

This blog features a sophisticated semantic search system that goes beyond simple keyword matching. The search engine uses vector embeddings, intelligent ranking, and advanced query parsing to help readers find relevant content quickly and accurately.

This is part 8 of the Blog Architecture Deep Dive series. Start with Blog Architecture - Overview if you haven't read it yet.

Overview

The search system combines multiple techniques to provide accurate, fast search results:

  1. Vector Embeddings: Converts content into mathematical vectors for semantic similarity
  2. Intelligent Ranking: Boosts results based on where matches occur (title, excerpt, category)
  3. Query Parsing: Supports quoted phrases, excluded words, and boolean operators
  4. Real-time Search: Debounced search with keyboard navigation
  5. Result Highlighting: Visual feedback showing where matches were found

Architecture

Build-Time Embedding Generation

Search embeddings are generated during the build process (npm run generate-embeddings). This ensures fast search performance at runtime—no API calls needed for each search.

The embedding generation process:

  1. Content Extraction: Reads all published markdown files

  2. Searchable Text Creation: Combines title, excerpt, and categories into searchable text

  3. Vector Generation: Creates vector representations using one of two methods:

    • Simple Feature Vectors (default): Zero-cost, text-based vectors using word frequency
    • OpenAI Embeddings (optional): Semantic embeddings using OpenAI's API for better understanding
  4. Caching: Saves embeddings to lib/embeddings.json for fast loading

Embedding Data Structure

Each embedding contains:

{
  slug: string;           // URL slug for the post
  title: string;          // Post title
  excerpt: string;        // Post excerpt
  categories: string[];   // Post categories
  searchableText: string; // Combined searchable text
  embedding?: number[];   // OpenAI embedding vector (if available)
  vector?: number[];      // Simple feature vector
  features?: string[];   // Feature names for simple vectors
}

Search Process

1. Query Parsing

The search engine parses queries to extract:

  • Quoted Phrases: Exact phrase matches using "quantum computing"
  • Search Words: Individual words (stop words filtered out)
  • Excluded Words: Words prefixed with - to exclude results

Example queries:

  • "quantum computing" - Finds exact phrase
  • fusion -nuclear - Finds fusion but excludes nuclear
  • planetary science mars - Finds posts matching all words

Interactive Demo: Query Parsing

Type different queries to see how they're parsed. Try:

  • "exact phrase" - Creates a phrase match
  • word1 word2 -exclude - Extracts words and exclusions
  • "multiple phrases" and words -excluded - Combines all features

2. Vector Similarity Calculation

For each post, the system calculates similarity using cosine similarity:

cosineSimilarity(queryVector, postVector) = 
  dotProduct(queryVector, postVector) / 
  (norm(queryVector) * norm(postVector))

This measures the angle between vectors—smaller angles mean more similar content.

Interactive Demo: Cosine Similarity

Try changing the query and document text to see how the similarity score changes. The visualization shows:

  • Word frequency vectors for both query and document
  • The calculated cosine similarity score
  • The angle between vectors (smaller angle = higher similarity)

3. Intelligent Ranking

The base similarity score is boosted based on where matches occur:

  • Title Match: +0.5 for exact title match, +0.3 for partial
  • Exact Phrase Match: +0.4 in title, +0.3 in excerpt, +0.2 in content
  • Word Matches: +0.2 per word in title, +0.15 in categories
  • All Words Match: +0.2 bonus when all query words are found

This ensures that posts with matches in titles or categories rank higher than those with matches only in content.

Interactive Demo: Ranking & Boosting

Try different search queries to see how documents are re-ranked based on:

  • Where matches occur (title vs excerpt vs category)
  • How many query words match
  • The base similarity score

Documents with matches in titles rank higher, even if their base similarity score is lower.

4. Filtering

Results can be filtered by:

  • Category: Filter to specific categories
  • Minimum Score: Set a relevance threshold
  • Excluded Words: Automatically filter out posts containing excluded terms

Search API

The search API (/api/search) accepts:

POST /api/search
{
  query: string;        // Search query
  category?: string;    // Optional category filter
  limit?: number;       // Max results (default: 5)
  minScore?: number;    // Minimum relevance score
}

Returns:

{
  results: Array<{
    slug: string;
    title: string;
    excerpt: string;
    score: number;
    matchType: 'title' | 'excerpt' | 'category' | 'content';
  }>;
  query: string;
  totalResults: number;
}

User Interface Features

Real-Time Search

The search input uses debouncing (300ms) to avoid excessive API calls while typing. Results update automatically as you type.

Keyboard Navigation

  • Arrow Down/Up: Navigate through results
  • Enter: Open selected result
  • Escape: Close search and clear query

Visual Feedback

  • Highlighting: Search terms are highlighted in results using <mark> tags
  • Match Type Indicators: Shows whether match was in title, excerpt, or category
  • Selected State: Visual outline shows currently selected result

Search Tips

When no results are found, the UI displays helpful tips:

  • Using quotes for exact phrases
  • Excluding words with minus
  • Trying broader search terms

Embedding Options

Simple Feature Vectors (Default)

Pros:

  • Zero cost (no API calls)
  • Works offline
  • Fast generation
  • Good for keyword-based searches

Cons:

  • Less semantic understanding
  • May miss synonyms
  • Limited context awareness

OpenAI Embeddings (Optional)

Pros:

  • Better semantic understanding
  • Handles synonyms and related concepts
  • More context-aware
  • Better for natural language queries

Cons:

  • Requires API key
  • Has costs (though minimal with text-embedding-3-small)
  • Requires network during build

To enable OpenAI embeddings, set OPENAI_API_KEY environment variable during build.

Performance Optimizations

  1. Build-Time Generation: Embeddings created once at build time, not per-request
  2. Cached Loading: Embeddings loaded from JSON file (fast file I/O)
  3. Debounced Queries: Reduces API calls while typing
  4. Limited Results: Default limit of 5-8 results keeps response small
  5. Early Filtering: Filters applied before expensive similarity calculations

Example Searches

Exact Phrase

"quantum computing"

Finds posts containing the exact phrase "quantum computing".

Excluded Terms

fusion -nuclear

Finds posts about fusion but excludes those mentioning nuclear.

Category Search

planetary science

Finds posts matching "planetary science" with higher ranking for posts in that category.

Natural Language

articles about 3D graphics

Uses semantic understanding to find posts about 3D graphics, even if they don't contain those exact words.

Future Enhancements

Potential improvements for the search system:

  1. Full-Text Search: Search within full article content, not just metadata
  2. Fuzzy Matching: Handle typos and variations
  3. Search History: Remember recent searches
  4. Search Analytics: Track popular searches
  5. Autocomplete/Suggestions: Show suggestions as user types
  6. Date Range Filtering: Filter by publication date
  7. Series Filtering: Filter to specific series
  8. Advanced Boolean Operators: Support AND, OR, NOT operators

Technical Implementation

Key Files

  • lib/vector-search.ts: Core search logic with ranking and query parsing
  • app/api/search/route.ts: Search API endpoint
  • components/core/HeaderSearch.tsx: Search UI component
  • scripts/generate-embeddings.js: Build-time embedding generation

Search Flow

User types query
Debounce (300ms)
Parse query (phrases, words, exclusions)
Generate query vector
Calculate cosine similarity for all posts
Apply boosting (title, category, phrases)
Filter by category/exclusions
Sort by score
Return top N results
Highlight matches in UI

Conclusion

The search engine combines vector similarity, intelligent ranking, and advanced query parsing to provide fast, accurate search results. By generating embeddings at build time, the system achieves excellent performance without requiring external APIs at runtime (though OpenAI embeddings are available as an optional enhancement).

The system is designed to be:

  • Fast: Cached embeddings, debounced queries, limited results
  • Accurate: Semantic understanding with intelligent ranking
  • User-Friendly: Keyboard navigation, highlighting, helpful tips
  • Flexible: Supports multiple embedding methods and query types

For more details on the blog architecture, see the other articles in the Blog Architecture Deep Dive - Series Index series.

Related Content

Blog Architecture - Overview

Blog Architecture - Overview This is the first in a series of articles exploring the architecture of this blog. We'll start with a high-level overview that anyone can understand, then progressively di...

Blog Architecture Deep Dive - Series Index

Blog Architecture Deep Dive - Series Index This page serves as an index for the Blog Architecture Deep Dive series: a comprehensive exploration of how this blog is built, from high-level concepts to d...