Full Text Search with Typo Tolerance
Overview
This document explains how we implemented fuzzy search (typo tolerance) in the listings_jobs table using PostgreSQL's trigram similarity feature combined with traditional full text search.
The Problem
Standard PostgreSQL full text search (tsearch) does not handle typos. If a user searches for "casier" instead of "cashier", tsearch returns no results because it works with exact word stems and tokens.
Example:
- Search: "casier"
- Expected: Jobs with "cashier" in title/description
- Actual (with tsearch only): No results ❌
How Full Text Search Works (First Principles)
1. Traditional Text Search (LIKE operator)
The simplest approach to text search uses SQL's LIKE operator:
SELECT * FROM listings_jobs WHERE title LIKE '%cashier%';
Problems:
- Very slow on large tables (requires full table scan)
- No relevance ranking
- No typo tolerance
- Case-sensitive unless you use
ILIKE
2. Full Text Search (tsearch)
PostgreSQL's full text search tokenizes and stems words, then stores them in a tsvector column:
-- Original text: "Looking for experienced cashier"
-- Tokenized and stemmed: 'cashier':4 'experienc':3 'look':1
How it works:
- Text is broken into tokens (words)
- Tokens are stemmed (e.g., "running" → "run", "cashiers" → "cashier")
- Tokens are stored in a
tsvectorcolumn with positions - A GIN index is created on the tsvector for fast lookup
- Searches use
@@operator to match against the tsvector
Benefits:
- ⚡ Very fast (uses GIN index)
- 📊 Relevance ranking built-in
- 🔤 Language-aware stemming (handles plurals, tenses)
- 🔍 Prefix matching
Limitations:
- ❌ No typo tolerance
- ❌ Exact stem matches only
3. Trigram Similarity (pg_trgm)
Trigrams break text into overlapping 3-character sequences to measure similarity:
"cashier" → {" ca", "cas", "ash", "shi", "hie", "ier", "er "}
"casier" → {" ca", "cas", "asi", "sie", "ier", "er "}
How similarity is calculated:
Similarity = (common trigrams) / (total unique trigrams)
Common trigrams between "cashier" and "casier":
{" ca", "cas", "ier", "er "} = 4 trigrams
Total unique trigrams: 9
Similarity = 4/9 ≈ 0.44
Benefits:
- ✅ Handles typos and misspellings
- ✅ Works with fuzzy matches
- ✅ Phonetic similarity
- ✅ Works on any text column
Limitations:
- 🐌 Slower than tsearch (but still reasonably fast with GIN indexes)
- 📊 Different ranking algorithm
Our Implementation Strategy
We combine both tsearch and trigram to get the best of both worlds:
pg_search_scope :full_text_search,
against: [
:company_name,
:category_name,
:title,
:description,
:description_summary,
:requirements_raw,
:benefits_raw
],
using: {
tsearch: {
dictionary: 'english',
tsvector_column: :search_vector,
prefix: true
},
trigram: {
threshold: 0.2,
word_similarity: true
}
}
How It Works Together
When you call Listings::Job.full_text_search("casier"):
-
tsearch looks for exact stem matches in the
search_vectorcolumn- Fast lookup via GIN index
- Returns results with exact matches (if any)
-
trigram calculates similarity against all specified columns
- Finds "cashier" because similarity to "casier" is 0.44 (above 0.2 threshold)
- Returns fuzzy matches
-
pg_search combines and ranks both result sets
- Exact matches get higher scores
- Fuzzy matches get lower scores (but still included)
- Results are sorted by relevance
Configuration Parameters Explained
threshold: 0.2
This controls how similar strings need to be to match:
0.1= very lenient (matches almost anything)0.2= balanced (recommended starting point)0.3= moderate0.5= strict
Example comparisons:
- "cashier" vs "casier" = 0.44 ✅ (matches at 0.2 threshold)
- "cashier" vs "waiter" = 0.09 ❌ (doesn't match at 0.2 threshold)
word_similarity: true
This uses PostgreSQL's word_similarity() function instead of similarity():
-
similarity(): Compares entire strings- "full time cashier" vs "cashier" = lower score
-
word_similarity(): Finds best matching word- "full time cashier" vs "cashier" = higher score (matches the word "cashier")
Recommendation: Use word_similarity: true for multi-word searches.
against: [:column_names]
Specifies which columns trigram should search. This is required because trigram works on actual column text, not on the pre-computed search_vector.
Database Schema Changes
Migration Created
class EnableExtensionPgTrgm < ActiveRecord::Migration[8.0]
def change
enable_extension "pg_trgm" unless extension_enabled?("pg_trgm")
end
end
This enables the PostgreSQL trigram extension, which provides:
similarity()functionword_similarity()function%operator for similarity matching- GIN/GIST index support for trigrams
Existing Index
We already have a GIN index on the search_vector column for fast tsearch:
t.index ["search_vector"], name: "index_listings_jobs_on_search_vector", using: :gin
Note: You can optionally add trigram indexes for better performance:
add_index :listings_jobs, :title, using: :gin, opclass: :gin_trgm_ops
However, this is usually not necessary unless you have millions of rows.
Testing the Implementation
1. Run the Migration
bundle exec rails db:migrate
2. Test in Rails Console
# Search with exact match (tsearch handles this)
Listings::Job.full_text_search("cashier").count
# Search with typo (trigram handles this)
Listings::Job.full_text_search("casier").count
# Should return same results (or very similar)
3. Test Different Typo Severities
# Mild typo (1 character different)
Listings::Job.full_text_search("casier")
# Moderate typo (2 characters different)
Listings::Job.full_text_search("casher")
# Severe typo (might not match depending on threshold)
Listings::Job.full_text_search("cascr")
4. Check Similarity Scores
require 'pg_search'
# Calculate similarity manually in PostgreSQL
ActiveRecord::Base.connection.execute(
"SELECT similarity('cashier', 'casier') as score"
).first
# => {"score"=>"0.444444"}
ActiveRecord::Base.connection.execute(
"SELECT similarity('cashier', 'cascr') as score"
).first
# => {"score"=>"0.222222"}
Performance Considerations
Speed Comparison
| Method | Speed | Use Case |
|---|---|---|
| tsearch | 🚀 Fastest | Exact matches, large datasets |
| trigram | 🏃 Fast | Fuzzy matches, typo tolerance |
| LIKE | 🐌 Slow | Small datasets, simple queries |
Optimization Tips
-
Keep both tsearch and trigram for best results
- tsearch handles 90% of queries (exact matches)
- trigram catches the remaining 10% (typos)
-
Adjust threshold based on your needs
- Start with 0.2
- Monitor false positives (irrelevant results)
- Increase threshold if too many false positives
- Decrease threshold if missing relevant results
-
Use selective
againstcolumns- Only include columns users actually search
- More columns = slower trigram search
- Title and description are usually sufficient
-
Monitor query performance
# Check slow queries
Listings::Job.full_text_search("casier").explain
Tuning the Threshold
Start with threshold: 0.2 and adjust based on testing:
Too Many False Positives?
Increase threshold to 0.3 or 0.4:
trigram: {
threshold: 0.3, # More strict
word_similarity: true
}
Missing Relevant Results?
Decrease threshold to 0.15:
trigram: {
threshold: 0.15, # More lenient
word_similarity: true
}
Test Different Thresholds
# Create a helper method to test
def test_threshold(query, threshold)
Listings::Job.full_text_search(query).reorder(nil).select do |job|
similarity = ActiveRecord::Base.connection.execute(
"SELECT similarity('#{query}', '#{job.title}')"
).first['similarity'].to_f
similarity >= threshold
end
end
# Test different thresholds
test_threshold("casier", 0.2).count # More results
test_threshold("casier", 0.3).count # Fewer results
test_threshold("casier", 0.4).count # Even fewer
Common Issues and Solutions
Issue: Slow Queries
Solution: Add trigram indexes on frequently searched columns:
class AddTrigramIndexes < ActiveRecord::Migration[8.0]
def change
add_index :listings_jobs, :title,
using: :gin,
opclass: :gin_trgm_ops,
name: 'index_listings_jobs_on_title_trgm'
end
end
Issue: Too Many False Positives
Solution: Increase the threshold or use word_similarity instead of regular similarity:
trigram: {
threshold: 0.3,
word_similarity: true # Better for multi-word fields
}
Issue: Not Finding Expected Results
Solutions:
- Check if the text exists in the
againstcolumns - Lower the threshold
- Verify the extension is enabled:
SELECT * FROM pg_extension WHERE extname = 'pg_trgm'; - Check the similarity score manually
Alternative Approaches Considered
1. Levenshtein Distance (fuzzystrmatch)
enable_extension 'fuzzystrmatch'
# Usage
SELECT * FROM listings_jobs
WHERE levenshtein(title, 'casier') < 3;
Why we didn't use it:
- Slower than trigrams
- Doesn't work well with multi-word fields
- No built-in GIN index support
2. Double Metaphone (Phonetic)
using: {
dmetaphone: {}
}
Why we didn't use it:
- Only useful for phonetic similarity ("Stephen" vs "Steven")
- Doesn't handle typos well
- English-language specific
3. ElasticSearch / OpenSearch
Why we didn't use it:
- Adds infrastructure complexity
- PostgreSQL trigrams are sufficient for our scale
- Additional cost and maintenance
- Data synchronization issues
When to Use Each Method
| Scenario | Method | Reason |
|---|---|---|
| Exact word matches | tsearch only | Fastest |
| Typo tolerance | tsearch + trigram | Best balance |
| Phonetic matching | dmetaphone | Sounds-alike words |
| Millions of rows | ElasticSearch | Better scale |
| Simple substring | LIKE/ILIKE | Simplest |
References
Summary
✅ What we implemented:
- Enabled
pg_trgmextension - Combined tsearch + trigram in
pg_search_scope - Configured reasonable defaults (threshold: 0.2, word_similarity: true)
✅ Benefits:
- Fast exact matches via tsearch
- Typo tolerance via trigram
- Automatic result ranking
- No additional infrastructure
✅ Trade-offs:
- Slightly slower than pure tsearch (but still fast)
- Requires tuning threshold for your use case
- More complex query execution
✅ Next steps:
- Run the migration
- Test with real search queries
- Monitor performance
- Adjust threshold as needed
- Consider adding trigram indexes if queries are slow