Skip to main content

Full Text Search with Typo Tolerance

Overview

This document explains how we implemented fuzzy search (typo tolerance) in the listings_jobs table using PostgreSQL's trigram similarity feature combined with traditional full text search.

The Problem

Standard PostgreSQL full text search (tsearch) does not handle typos. If a user searches for "casier" instead of "cashier", tsearch returns no results because it works with exact word stems and tokens.

Example:

  • Search: "casier"
  • Expected: Jobs with "cashier" in title/description
  • Actual (with tsearch only): No results ❌

How Full Text Search Works (First Principles)

1. Traditional Text Search (LIKE operator)

The simplest approach to text search uses SQL's LIKE operator:

SELECT * FROM listings_jobs WHERE title LIKE '%cashier%';

Problems:

  • Very slow on large tables (requires full table scan)
  • No relevance ranking
  • No typo tolerance
  • Case-sensitive unless you use ILIKE

2. Full Text Search (tsearch)

PostgreSQL's full text search tokenizes and stems words, then stores them in a tsvector column:

-- Original text: "Looking for experienced cashier"
-- Tokenized and stemmed: 'cashier':4 'experienc':3 'look':1

How it works:

  1. Text is broken into tokens (words)
  2. Tokens are stemmed (e.g., "running" → "run", "cashiers" → "cashier")
  3. Tokens are stored in a tsvector column with positions
  4. A GIN index is created on the tsvector for fast lookup
  5. Searches use @@ operator to match against the tsvector

Benefits:

  • ⚡ Very fast (uses GIN index)
  • 📊 Relevance ranking built-in
  • 🔤 Language-aware stemming (handles plurals, tenses)
  • 🔍 Prefix matching

Limitations:

  • ❌ No typo tolerance
  • ❌ Exact stem matches only

3. Trigram Similarity (pg_trgm)

Trigrams break text into overlapping 3-character sequences to measure similarity:

"cashier" → {" ca", "cas", "ash", "shi", "hie", "ier", "er "}
"casier" → {" ca", "cas", "asi", "sie", "ier", "er "}

How similarity is calculated:

Similarity = (common trigrams) / (total unique trigrams)

Common trigrams between "cashier" and "casier":
{" ca", "cas", "ier", "er "} = 4 trigrams

Total unique trigrams: 9

Similarity = 4/9 ≈ 0.44

Benefits:

  • ✅ Handles typos and misspellings
  • ✅ Works with fuzzy matches
  • ✅ Phonetic similarity
  • ✅ Works on any text column

Limitations:

  • 🐌 Slower than tsearch (but still reasonably fast with GIN indexes)
  • 📊 Different ranking algorithm

Our Implementation Strategy

We combine both tsearch and trigram to get the best of both worlds:

pg_search_scope :full_text_search,
against: [
:company_name,
:category_name,
:title,
:description,
:description_summary,
:requirements_raw,
:benefits_raw
],
using: {
tsearch: {
dictionary: 'english',
tsvector_column: :search_vector,
prefix: true
},
trigram: {
threshold: 0.2,
word_similarity: true
}
}

How It Works Together

When you call Listings::Job.full_text_search("casier"):

  1. tsearch looks for exact stem matches in the search_vector column

    • Fast lookup via GIN index
    • Returns results with exact matches (if any)
  2. trigram calculates similarity against all specified columns

    • Finds "cashier" because similarity to "casier" is 0.44 (above 0.2 threshold)
    • Returns fuzzy matches
  3. pg_search combines and ranks both result sets

    • Exact matches get higher scores
    • Fuzzy matches get lower scores (but still included)
    • Results are sorted by relevance

Configuration Parameters Explained

threshold: 0.2

This controls how similar strings need to be to match:

  • 0.1 = very lenient (matches almost anything)
  • 0.2 = balanced (recommended starting point)
  • 0.3 = moderate
  • 0.5 = strict

Example comparisons:

  • "cashier" vs "casier" = 0.44 ✅ (matches at 0.2 threshold)
  • "cashier" vs "waiter" = 0.09 ❌ (doesn't match at 0.2 threshold)

word_similarity: true

This uses PostgreSQL's word_similarity() function instead of similarity():

  • similarity(): Compares entire strings

    • "full time cashier" vs "cashier" = lower score
  • word_similarity(): Finds best matching word

    • "full time cashier" vs "cashier" = higher score (matches the word "cashier")

Recommendation: Use word_similarity: true for multi-word searches.

against: [:column_names]

Specifies which columns trigram should search. This is required because trigram works on actual column text, not on the pre-computed search_vector.

Database Schema Changes

Migration Created

class EnableExtensionPgTrgm < ActiveRecord::Migration[8.0]
def change
enable_extension "pg_trgm" unless extension_enabled?("pg_trgm")
end
end

This enables the PostgreSQL trigram extension, which provides:

  • similarity() function
  • word_similarity() function
  • % operator for similarity matching
  • GIN/GIST index support for trigrams

Existing Index

We already have a GIN index on the search_vector column for fast tsearch:

t.index ["search_vector"], name: "index_listings_jobs_on_search_vector", using: :gin

Note: You can optionally add trigram indexes for better performance:

add_index :listings_jobs, :title, using: :gin, opclass: :gin_trgm_ops

However, this is usually not necessary unless you have millions of rows.

Testing the Implementation

1. Run the Migration

bundle exec rails db:migrate

2. Test in Rails Console

# Search with exact match (tsearch handles this)
Listings::Job.full_text_search("cashier").count

# Search with typo (trigram handles this)
Listings::Job.full_text_search("casier").count

# Should return same results (or very similar)

3. Test Different Typo Severities

# Mild typo (1 character different)
Listings::Job.full_text_search("casier")

# Moderate typo (2 characters different)
Listings::Job.full_text_search("casher")

# Severe typo (might not match depending on threshold)
Listings::Job.full_text_search("cascr")

4. Check Similarity Scores

require 'pg_search'

# Calculate similarity manually in PostgreSQL
ActiveRecord::Base.connection.execute(
"SELECT similarity('cashier', 'casier') as score"
).first
# => {"score"=>"0.444444"}

ActiveRecord::Base.connection.execute(
"SELECT similarity('cashier', 'cascr') as score"
).first
# => {"score"=>"0.222222"}

Performance Considerations

Speed Comparison

MethodSpeedUse Case
tsearch🚀 FastestExact matches, large datasets
trigram🏃 FastFuzzy matches, typo tolerance
LIKE🐌 SlowSmall datasets, simple queries

Optimization Tips

  1. Keep both tsearch and trigram for best results

    • tsearch handles 90% of queries (exact matches)
    • trigram catches the remaining 10% (typos)
  2. Adjust threshold based on your needs

    • Start with 0.2
    • Monitor false positives (irrelevant results)
    • Increase threshold if too many false positives
    • Decrease threshold if missing relevant results
  3. Use selective against columns

    • Only include columns users actually search
    • More columns = slower trigram search
    • Title and description are usually sufficient
  4. Monitor query performance

    # Check slow queries
    Listings::Job.full_text_search("casier").explain

Tuning the Threshold

Start with threshold: 0.2 and adjust based on testing:

Too Many False Positives?

Increase threshold to 0.3 or 0.4:

trigram: {
threshold: 0.3, # More strict
word_similarity: true
}

Missing Relevant Results?

Decrease threshold to 0.15:

trigram: {
threshold: 0.15, # More lenient
word_similarity: true
}

Test Different Thresholds

# Create a helper method to test
def test_threshold(query, threshold)
Listings::Job.full_text_search(query).reorder(nil).select do |job|
similarity = ActiveRecord::Base.connection.execute(
"SELECT similarity('#{query}', '#{job.title}')"
).first['similarity'].to_f

similarity >= threshold
end
end

# Test different thresholds
test_threshold("casier", 0.2).count # More results
test_threshold("casier", 0.3).count # Fewer results
test_threshold("casier", 0.4).count # Even fewer

Common Issues and Solutions

Issue: Slow Queries

Solution: Add trigram indexes on frequently searched columns:

class AddTrigramIndexes < ActiveRecord::Migration[8.0]
def change
add_index :listings_jobs, :title,
using: :gin,
opclass: :gin_trgm_ops,
name: 'index_listings_jobs_on_title_trgm'
end
end

Issue: Too Many False Positives

Solution: Increase the threshold or use word_similarity instead of regular similarity:

trigram: {
threshold: 0.3,
word_similarity: true # Better for multi-word fields
}

Issue: Not Finding Expected Results

Solutions:

  1. Check if the text exists in the against columns
  2. Lower the threshold
  3. Verify the extension is enabled: SELECT * FROM pg_extension WHERE extname = 'pg_trgm';
  4. Check the similarity score manually

Alternative Approaches Considered

1. Levenshtein Distance (fuzzystrmatch)

enable_extension 'fuzzystrmatch'

# Usage
SELECT * FROM listings_jobs
WHERE levenshtein(title, 'casier') < 3;

Why we didn't use it:

  • Slower than trigrams
  • Doesn't work well with multi-word fields
  • No built-in GIN index support

2. Double Metaphone (Phonetic)

using: {
dmetaphone: {}
}

Why we didn't use it:

  • Only useful for phonetic similarity ("Stephen" vs "Steven")
  • Doesn't handle typos well
  • English-language specific

3. ElasticSearch / OpenSearch

Why we didn't use it:

  • Adds infrastructure complexity
  • PostgreSQL trigrams are sufficient for our scale
  • Additional cost and maintenance
  • Data synchronization issues

When to Use Each Method

ScenarioMethodReason
Exact word matchestsearch onlyFastest
Typo tolerancetsearch + trigramBest balance
Phonetic matchingdmetaphoneSounds-alike words
Millions of rowsElasticSearchBetter scale
Simple substringLIKE/ILIKESimplest

References

Summary

What we implemented:

  • Enabled pg_trgm extension
  • Combined tsearch + trigram in pg_search_scope
  • Configured reasonable defaults (threshold: 0.2, word_similarity: true)

Benefits:

  • Fast exact matches via tsearch
  • Typo tolerance via trigram
  • Automatic result ranking
  • No additional infrastructure

Trade-offs:

  • Slightly slower than pure tsearch (but still fast)
  • Requires tuning threshold for your use case
  • More complex query execution

Next steps:

  1. Run the migration
  2. Test with real search queries
  3. Monitor performance
  4. Adjust threshold as needed
  5. Consider adding trigram indexes if queries are slow