The pg_similarity extension provides a collection of similarity measures for comparing text strings in PostgreSQL databases. It includes a comprehensive collection of search algorithms, which you can find listed toward the end of this document. Your Nile database arrives with this extension already enabled.

Usage Examples

Lets show how to use the pg_similarity extension to find similar products in our database.

Creating a Table for Text Comparison

CREATE TABLE products (
  tenant_id uuid,
  id integer,
  name text,
  description text,
  PRIMARY KEY (tenant_id, id)
);

Inserting Sample Data

-- Create a tenant first
INSERT INTO tenants (id, name) VALUES
  ('d1c06023-3421-4fbb-9dd1-c96e42d2fd02', 'Tenant 1');

INSERT INTO products (tenant_id, id, name, description) VALUES
  ('d1c06023-3421-4fbb-9dd1-c96e42d2fd02', 1, 'Laptop Computer', 'High-performance laptop'),
  ('d1c06023-3421-4fbb-9dd1-c96e42d2fd02', 2, 'Laptop Computr', 'High-performance notebook'),
  ('d1c06023-3421-4fbb-9dd1-c96e42d2fd02', 3, 'Desktop Computer', 'Powerful desktop workstation');

Finding Similar Names

-- Using Levenshtein for typo detection
SELECT name, 
       lev(name, 'Laptop Computer') as similarity
FROM products
ORDER BY similarity DESC;

-- Using operator syntax
SELECT name
FROM products
WHERE name ~== 'Laptop Computer';

-- Using threshold to show only more similar results
SET pg_similarity.levenshtein_threshold = 0.8;

SELECT name
FROM products
WHERE name ~== 'Laptop Computer';

Finding Similar Content

-- Using Cosine for content matching
SELECT name, 
       cosine(description, 'high performance notebook') as similarity
FROM products
WHERE cosine(description, 'high performance notebook') > 0.5
ORDER BY similarity DESC;

-- Using operator syntax
-- This will return nothing because the threshold is 0.7
SELECT name
FROM products
WHERE description ~## 'high performance notebook';

-- Set threshold for specific algorithm
SET pg_similarity.cosine_threshold = 0.3;

-- Now this will return the results
SELECT name
FROM products
WHERE description ~## 'high performance notebook';

Configuration

Each similarity measure has two or three configuration options:

  • threshold: The threshold for the similarity measure.
  • is_normalized: Whether the similarity measure is normalized (between 0 and 1) or not.
  • tokenizer: The tokenizer to use for the similarity measure ( Default is alnum, and other options are gram, word, and camelcase). Note that not every algorithm supports the tokenizer option.

To use a specific configuration, you can use SET pg_similarity.<algorithm_name>_<config_name> = value.

For example, to use the cosine similarity measure with a threshold of 0.3, you can use:

SET pg_similarity.cosine_threshold = 0.3;

To reset the threshold to the default value, you can use:

RESET pg_similarity.cosine_threshold;

Common Use Cases

  • Finding duplicate records with slight variations
  • Implementing spell checking and typo-tolerant search
  • Matching similar names or addresses
  • Finding similar content
  • Phonetic matching (e.g., “Smith” vs “Smyth”)
  • Record linkage across databases

Complete Algorithm Reference

The extension includes the following similarity algorithms:

AlgorithmFunctionOperator
Blockblock(text, text)~++
Cosinecosine(text, text)~##
Dicedice(text, text)~-~
Euclideaneuclidean(text, text)~!!
Hamminghamming(text, text)@
Jaccardjaccard(text, text)~??
Jarojaro(text, text)~%%
Jaro-Winklerjarowinkler(text, text)~@@
Levenshteinlev(text, text)~==
Matchingmatching(text, text)~^^
Monge-Elkanmongeelkan(text, text)~
Needleman-Wunschneedlemanwunch(text, text)#
Overlapoverlap(text, text)~**
Q-Gramqgram(text, text)~~~
Smith-Watermansmithwaterman(text, text)=
Smith-Waterman-Gotohsmithwatermangotoh(text, text)!
Soundexsoundex(text, text)~*~

Each algorithm is best suited for a different use case. For example, jaro is better for name matching, while levenshtein is better for typo detection.

Limitations

  • Some algorithms may be computationally expensive
  • Not all measures are suitable for all languages
  • Memory usage can be high for large strings
  • Some algorithms may not work well with very short strings

Additional Resources

Was this page helpful?