pg_trgm
Similarity search in PostgreSQL using trigrams
The pg_trgm
extension provides trigram matching capabilities for fast text similarity search and fuzzy string matching in PostgreSQL. A trigram is a group of three consecutive characters taken from a string. This extension is particularly useful for implementing features like fuzzy search, spell checking, and finding similar strings.
Your Nile database arrives with the pg_trgm
extension already enabled.
Understanding Trigrams
A trigram is created by taking three consecutive characters from a text string. For example, the word “hello” contains these trigrams: ” h” (with two spaces), “he”, “hel”, “ell”, “llo”, “lo ” (with two spaces). The spaces at the beginning and end are important for matching word boundaries.
Similarity between strings is calculated by counting how many trigrams they share in common. For example, “hello” and “helo” share most of their trigrams (like ” h”, “hel”, “lo ”), resulting in a high similarity score. This makes trigrams excellent for fuzzy matching and finding similar strings even when they contain typos.
Usage Examples
Let’s explore how to use pg_trgm with a practical example using a products database.
Creating a Table with Text Search
Inserting Sample Data
Basic Similarity Queries
Show Trigrams
Setting Similarity Threshold
Word Similarity
Functions and Operators
Main Functions
similarity(text, text)
: Returns similarity between two strings (0 to 1)show_trgm(text)
: Shows trigrams in a stringword_similarity(text, text)
: Returns word-based similaritystrict_word_similarity(text, text)
: Returns strict word-based similarityshow_limit()
: Shows current similarity threshold
Operators
%
: Returns true if strings are similar (uses similarity threshold)<%
: Returns true if first string is less similar than second%>
: Returns true if first string is more similar than second<->
: Returns distance between strings (1 - similarity)<<->
: Returns word-based distance<->>
: Returns strict word-based distance
Index Types
pg_trgm supports two types of indexes:
GiST Index
- Balanced performance between search and update
- Smaller index size
- Good for dynamic data
GIN Index
- Faster searches
- Slower updates
- Larger index size
- Better for static data
Best Practices
-
Index Selection:
- Use GIN for mostly-read data
- Use GiST for frequently updated data
- Consider creating indexes only on frequently searched columns
-
Threshold Tuning:
- Lower threshold (e.g., 0.2) for more matches
- Higher threshold (e.g., 0.5) for stricter matching
- Test with your data to find the optimal value
-
Performance Optimization:
- Use word_similarity() for whole word matching
- Create indexes on specific columns rather than all text columns
- Monitor index size and rebuild when necessary
Common Use Cases
- Fuzzy search functionality
- Spell-check suggestions
- Auto-complete features
- Finding similar product names
- Matching addresses with typos
- Search with tolerance for misspellings
Performance Considerations
- GIN indexes provide faster search but slower updates
- Index size can be large for text columns with many unique values
- Consider partial indexes for large tables
- Monitor and adjust similarity threshold based on false positive/negative rates
Limitations
- Not suitable for very short strings (less than 3 characters)
- May produce false positives
- Index size can be large for big text columns
- Not ideal for exact matching (use standard indexes instead)
Additional Resources
Was this page helpful?