pg_similarity
Text similarity measures in PostgreSQL
The pg_similarity
extension provides a collection of similarity measures for comparing text strings in PostgreSQL databases.
It includes a comprehensive collection of search algorithms, which you can find listed toward the end of this document.
Your Nile database arrives with this extension already enabled.
Usage Examples
Lets show how to use the pg_similarity
extension to find similar products in our database.
Creating a Table for Text Comparison
Inserting Sample Data
Finding Similar Names
Finding Similar Content
Configuration
Each similarity measure has two or three configuration options:
threshold
: The threshold for the similarity measure.is_normalized
: Whether the similarity measure is normalized (between 0 and 1) or not.tokenizer
: The tokenizer to use for the similarity measure ( Default is alnum, and other options are gram, word, and camelcase). Note that not every algorithm supports the tokenizer option.
To use a specific configuration, you can use SET pg_similarity.<algorithm_name>_<config_name> = value
.
For example, to use the cosine
similarity measure with a threshold of 0.3, you can use:
To reset the threshold to the default value, you can use:
Common Use Cases
- Finding duplicate records with slight variations
- Implementing spell checking and typo-tolerant search
- Matching similar names or addresses
- Finding similar content
- Phonetic matching (e.g., “Smith” vs “Smyth”)
- Record linkage across databases
Complete Algorithm Reference
The extension includes the following similarity algorithms:
Algorithm | Function | Operator | ||
---|---|---|---|---|
Block | block(text, text) | ~++ | ||
Cosine | cosine(text, text) | ~## | ||
Dice | dice(text, text) | ~-~ | ||
Euclidean | euclidean(text, text) | ~!! | ||
Hamming | hamming(text, text) | |||
Jaccard | jaccard(text, text) | ~?? | ||
Jaro | jaro(text, text) | ~%% | ||
Jaro-Winkler | jarowinkler(text, text) | ~@@ | ||
Levenshtein | lev(text, text) | ~== | ||
Matching | matching(text, text) | ~^^ | ||
Monge-Elkan | mongeelkan(text, text) | ~ | ||
Needleman-Wunsch | needlemanwunch(text, text) | |||
Overlap | overlap(text, text) | ~** | ||
Q-Gram | qgram(text, text) | ~~~ | ||
Smith-Waterman | smithwaterman(text, text) | |||
Smith-Waterman-Gotoh | smithwatermangotoh(text, text) | |||
Soundex | soundex(text, text) | ~*~ |
Each algorithm is best suited for a different use case.
For example, jaro
is better for name matching, while levenshtein
is better for typo detection.
Limitations
- Some algorithms may be computationally expensive
- Not all measures are suitable for all languages
- Memory usage can be high for large strings
- Some algorithms may not work well with very short strings