Full Text Search with Typo Tolerance

Overview

This document explains how we implemented fuzzy search (typo tolerance) in the listings_jobs table using PostgreSQL's trigram similarity feature combined with traditional full text search.

The Problem

Standard PostgreSQL full text search (tsearch) does not handle typos. If a user searches for "casier" instead of "cashier", tsearch returns no results because it works with exact word stems and tokens.

Example:

Search: "casier"
Expected: Jobs with "cashier" in title/description
Actual (with tsearch only): No results ❌

How Full Text Search Works (First Principles)

1. Traditional Text Search (LIKE operator)

The simplest approach to text search uses SQL's LIKE operator:

SELECT * FROM listings_jobs WHERE title LIKE '%cashier%';

Problems:

Very slow on large tables (requires full table scan)
No relevance ranking
No typo tolerance
Case-sensitive unless you use ILIKE

2. Full Text Search (tsearch)

PostgreSQL's full text search tokenizes and stems words, then stores them in a tsvector column:

-- Original text: "Looking for experienced cashier"
-- Tokenized and stemmed: 'cashier':4 'experienc':3 'look':1

How it works:

Text is broken into tokens (words)
Tokens are stemmed (e.g., "running" → "run", "cashiers" → "cashier")
Tokens are stored in a tsvector column with positions
A GIN index is created on the tsvector for fast lookup
Searches use @@ operator to match against the tsvector

Benefits:

⚡ Very fast (uses GIN index)
📊 Relevance ranking built-in
🔤 Language-aware stemming (handles plurals, tenses)
🔍 Prefix matching

Limitations:

❌ No typo tolerance
❌ Exact stem matches only

3. Trigram Similarity (pg_trgm)

Trigrams break text into overlapping 3-character sequences to measure similarity:

"cashier" → {" ca", "cas", "ash", "shi", "hie", "ier", "er "}
"casier"  → {" ca", "cas", "asi", "sie", "ier", "er "}

How similarity is calculated:

Similarity = (common trigrams) / (total unique trigrams)

Common trigrams between "cashier" and "casier":
{" ca", "cas", "ier", "er "} = 4 trigrams

Total unique trigrams: 9

Similarity = 4/9 ≈ 0.44

Benefits:

✅ Handles typos and misspellings
✅ Works with fuzzy matches
✅ Phonetic similarity
✅ Works on any text column

Limitations:

🐌 Slower than tsearch (but still reasonably fast with GIN indexes)
📊 Different ranking algorithm

Our Implementation Strategy

We combine both tsearch and trigram to get the best of both worlds:

pg_search_scope :full_text_search,
  against: [
    :company_name,
    :category_name,
    :title,
    :description,
    :description_summary,
    :requirements_raw,
    :benefits_raw
  ],
  using: {
    tsearch: {
      dictionary:      'english',
      tsvector_column: :search_vector,
      prefix:          true
    },
    trigram: {
      threshold:       0.2,
      word_similarity: true
    }
  }

How It Works Together

When you call Listings::Job.full_text_search("casier"):

tsearch looks for exact stem matches in the search_vector column
- Fast lookup via GIN index
- Returns results with exact matches (if any)
trigram calculates similarity against all specified columns
- Finds "cashier" because similarity to "casier" is 0.44 (above 0.2 threshold)
- Returns fuzzy matches
pg_search combines and ranks both result sets
- Exact matches get higher scores
- Fuzzy matches get lower scores (but still included)
- Results are sorted by relevance

Configuration Parameters Explained

`threshold: 0.2`

This controls how similar strings need to be to match:

0.1 = very lenient (matches almost anything)
0.2 = balanced (recommended starting point)
0.3 = moderate
0.5 = strict

Example comparisons:

"cashier" vs "casier" = 0.44 ✅ (matches at 0.2 threshold)
"cashier" vs "waiter" = 0.09 ❌ (doesn't match at 0.2 threshold)

`word_similarity: true`

This uses PostgreSQL's word_similarity() function instead of similarity():

similarity(): Compares entire strings
- "full time cashier" vs "cashier" = lower score
word_similarity(): Finds best matching word
- "full time cashier" vs "cashier" = higher score (matches the word "cashier")

Recommendation: Use word_similarity: true for multi-word searches.

`against: [:column_names]`

Specifies which columns trigram should search. This is required because trigram works on actual column text, not on the pre-computed search_vector.

Database Schema Changes

Migration Created

class EnableExtensionPgTrgm < ActiveRecord::Migration[8.0]
  def change
    enable_extension "pg_trgm" unless extension_enabled?("pg_trgm")
  end
end

This enables the PostgreSQL trigram extension, which provides:

similarity() function
word_similarity() function
% operator for similarity matching
GIN/GIST index support for trigrams

Existing Index

We already have a GIN index on the search_vector column for fast tsearch:

t.index ["search_vector"], name: "index_listings_jobs_on_search_vector", using: :gin

Note: You can optionally add trigram indexes for better performance:

add_index :listings_jobs, :title, using: :gin, opclass: :gin_trgm_ops

However, this is usually not necessary unless you have millions of rows.

Testing the Implementation

1. Run the Migration

bundle exec rails db:migrate

2. Test in Rails Console

# Search with exact match (tsearch handles this)
Listings::Job.full_text_search("cashier").count

# Search with typo (trigram handles this)
Listings::Job.full_text_search("casier").count

# Should return same results (or very similar)

3. Test Different Typo Severities

# Mild typo (1 character different)
Listings::Job.full_text_search("casier")

# Moderate typo (2 characters different)
Listings::Job.full_text_search("casher")

# Severe typo (might not match depending on threshold)
Listings::Job.full_text_search("cascr")

4. Check Similarity Scores

require 'pg_search'

# Calculate similarity manually in PostgreSQL
ActiveRecord::Base.connection.execute(
  "SELECT similarity('cashier', 'casier') as score"
).first
# => {"score"=>"0.444444"}

ActiveRecord::Base.connection.execute(
  "SELECT similarity('cashier', 'cascr') as score"
).first
# => {"score"=>"0.222222"}

Performance Considerations

Speed Comparison

Method	Speed	Use Case
tsearch	🚀 Fastest	Exact matches, large datasets
trigram	🏃 Fast	Fuzzy matches, typo tolerance
LIKE	🐌 Slow	Small datasets, simple queries

Optimization Tips

Keep both tsearch and trigram for best results
- tsearch handles 90% of queries (exact matches)
- trigram catches the remaining 10% (typos)
Adjust threshold based on your needs
- Start with 0.2
- Monitor false positives (irrelevant results)
- Increase threshold if too many false positives
- Decrease threshold if missing relevant results
Use selective against columns
- Only include columns users actually search
- More columns = slower trigram search
- Title and description are usually sufficient

Monitor query performance

# Check slow queries
Listings::Job.full_text_search("casier").explain

Tuning the Threshold

Start with threshold: 0.2 and adjust based on testing:

Too Many False Positives?

Increase threshold to 0.3 or 0.4:

trigram: {
  threshold: 0.3,  # More strict
  word_similarity: true
}

Missing Relevant Results?

Decrease threshold to 0.15:

trigram: {
  threshold: 0.15,  # More lenient
  word_similarity: true
}

Test Different Thresholds

# Create a helper method to test
def test_threshold(query, threshold)
  Listings::Job.full_text_search(query).reorder(nil).select do |job|
    similarity = ActiveRecord::Base.connection.execute(
      "SELECT similarity('#{query}', '#{job.title}')"
    ).first['similarity'].to_f

    similarity >= threshold
  end
end

# Test different thresholds
test_threshold("casier", 0.2).count  # More results
test_threshold("casier", 0.3).count  # Fewer results
test_threshold("casier", 0.4).count  # Even fewer

Common Issues and Solutions

Issue: Slow Queries

Solution: Add trigram indexes on frequently searched columns:

class AddTrigramIndexes < ActiveRecord::Migration[8.0]
  def change
    add_index :listings_jobs, :title,
      using: :gin,
      opclass: :gin_trgm_ops,
      name: 'index_listings_jobs_on_title_trgm'
  end
end

Issue: Too Many False Positives

Solution: Increase the threshold or use word_similarity instead of regular similarity:

trigram: {
  threshold: 0.3,
  word_similarity: true  # Better for multi-word fields
}

Issue: Not Finding Expected Results

Solutions:

Check if the text exists in the against columns
Lower the threshold
Verify the extension is enabled: SELECT * FROM pg_extension WHERE extname = 'pg_trgm';
Check the similarity score manually

Alternative Approaches Considered

1. Levenshtein Distance (fuzzystrmatch)

enable_extension 'fuzzystrmatch'

# Usage
SELECT * FROM listings_jobs
WHERE levenshtein(title, 'casier') < 3;

Why we didn't use it:

Slower than trigrams
Doesn't work well with multi-word fields
No built-in GIN index support

2. Double Metaphone (Phonetic)

using: {
  dmetaphone: {}
}

Why we didn't use it:

Only useful for phonetic similarity ("Stephen" vs "Steven")
Doesn't handle typos well
English-language specific

3. ElasticSearch / OpenSearch

Why we didn't use it:

Adds infrastructure complexity
PostgreSQL trigrams are sufficient for our scale
Additional cost and maintenance
Data synchronization issues

When to Use Each Method

Scenario	Method	Reason
Exact word matches	tsearch only	Fastest
Typo tolerance	tsearch + trigram	Best balance
Phonetic matching	dmetaphone	Sounds-alike words
Millions of rows	ElasticSearch	Better scale
Simple substring	LIKE/ILIKE	Simplest

References

Summary

✅ What we implemented:

Enabled pg_trgm extension
Combined tsearch + trigram in pg_search_scope
Configured reasonable defaults (threshold: 0.2, word_similarity: true)

✅ Benefits:

Fast exact matches via tsearch
Typo tolerance via trigram
Automatic result ranking
No additional infrastructure

✅ Trade-offs:

Slightly slower than pure tsearch (but still fast)
Requires tuning threshold for your use case
More complex query execution

✅ Next steps:

Run the migration
Test with real search queries
Monitor performance
Adjust threshold as needed
Consider adding trigram indexes if queries are slow

Overview​

The Problem​

How Full Text Search Works (First Principles)​

1. Traditional Text Search (LIKE operator)​

2. Full Text Search (tsearch)​

3. Trigram Similarity (pg_trgm)​

Our Implementation Strategy​

How It Works Together​

Configuration Parameters Explained​

threshold: 0.2​

word_similarity: true​

against: [:column_names]​

Database Schema Changes​

Migration Created​

Existing Index​

Testing the Implementation​

1. Run the Migration​

2. Test in Rails Console​

3. Test Different Typo Severities​

4. Check Similarity Scores​

Performance Considerations​

Speed Comparison​

Optimization Tips​

Tuning the Threshold​

Too Many False Positives?​

Missing Relevant Results?​

Test Different Thresholds​

Common Issues and Solutions​

Issue: Slow Queries​

Issue: Too Many False Positives​

Issue: Not Finding Expected Results​

Alternative Approaches Considered​

1. Levenshtein Distance (fuzzystrmatch)​

2. Double Metaphone (Phonetic)​

3. ElasticSearch / OpenSearch​

When to Use Each Method​

References​

Summary​

Overview

The Problem

How Full Text Search Works (First Principles)

1. Traditional Text Search (LIKE operator)

2. Full Text Search (tsearch)

3. Trigram Similarity (pg_trgm)

Our Implementation Strategy

How It Works Together

Configuration Parameters Explained

`threshold: 0.2`

`word_similarity: true`

`against: [:column_names]`

Database Schema Changes

Migration Created

Existing Index

Testing the Implementation

1. Run the Migration

2. Test in Rails Console

3. Test Different Typo Severities

4. Check Similarity Scores

Performance Considerations

Speed Comparison

Optimization Tips

Tuning the Threshold

Too Many False Positives?

Missing Relevant Results?

Test Different Thresholds

Common Issues and Solutions

Issue: Slow Queries

Issue: Too Many False Positives

Issue: Not Finding Expected Results

Alternative Approaches Considered

1. Levenshtein Distance (fuzzystrmatch)

2. Double Metaphone (Phonetic)

3. ElasticSearch / OpenSearch

When to Use Each Method

References

Summary