2024-11-05-pgvector-080/cover.png
2024-11-05
6 min read
gwenshap profile pic
Gwen Shapira

Announcing: pgvector 0.8.0 released and available on Nile

The pgvector community shipped the much anticipated version 0.8.0 with significant performance and functionality improvements. Naturally, we couldn't wait to make it available to Nile users.

Release highlights

Per the official release notes, pgvector 0.8.0 includes:

  • Added support for iterative index scans
  • Added casts for arrays to sparsevec
  • Improved cost estimation for better index selection when filtering
  • Improved performance of HNSW index scans
  • Improved performance of HNSW inserts and on-disk index builds
  • Dropped support for Postgres 12

The most anticipated feature is iterative index scanning, which addresses a longstanding challenge with vector indexes. Previously, filters were applied after the index scan completed, which often yielded fewer results than expected. According to the pgvector documentation:

With approximate indexes, filtering is applied after the index is scanned. If a condition matches 10% of rows, with HNSW and the default hnsw.ef_search of 40, only 4 rows will match on average.

Common workarounds include scanning more rows, using partial indexes, or partitioning, but these methods can be impractical or undesirable. The new iterative scan feature offers a more straightforward and intuitive solution:

  1. Scan the vector index
  2. Apply filter
  3. Check if enough results are returned. If not, repeat the scan.

Lets see this in action with a very small example. I strongly recommend running small experiments - you learn so much if the actual results don't match your expectations. Follow along for important pgvector lessons:

To start, I create a table with sample data:

CREATE TABLE filtest(id INTEGER, category INTEGER, embedding vector(3));
INSERT INTO filtest VALUES
    (1, 1, '[3, 1, -2]'),
    (2, 1, '[3, 1, -2]'),
    (3, 1, '[3, 1, -2]'),
    (1, 2, '[1.1, 2.2, 3.3]'),
    (2, 2, '[1.1, 2.2, 3.3]'),
    (3, 2, '[1.1, 2.2, 3.3]');
CREATE INDEX ON filtest USING hnsw (embedding vector_cosine_ops);

The table contains six rows divided into two categories. Vectors in category 2 closely resemble [1, 2, 3], while vectors in category 1 differ significantly, representing orthogonal (unrelated) data. In real use-cases, categories can be different companies (tenants), different departments within the same company, genres if the table stores information about movies, etc. Anything that you may want to use for filtering.

What would you expect the following query to return?

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1
ORDER BY distance LIMIT 3;

Index scans in pgvector 0.7.4

I searched for 3 vectors, from category 1, that are closest to [1,2,3], so the correct answer is to return all vectors from category 1. However, this is what I expected pgvector 0.7.4 to do:

  1. Scan the vector index and find the 3 vectors closest to [1,2,3], some of which should belong to category 2.
  2. Filter the results and keep only vectors from category 1.
  3. Return partial results.

However, in practice:

SELECT extversion FROM pg_extension WHERE extname = 'vector';
 extversion
------------
 0.7.4
(1 row)

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
 id | category |      distance
----+----------+--------------------
  3 |        1 | 1.0714285714285714
  1 |        1 | 1.0714285714285714
  2 |        1 | 1.0714285714285714

Why are we seeing the correct results? Because pgvector's default configuration for the number of rows to scan in the vector index is 40. So the index scan returned the entire table, and after filtering, we got the right result. Thats the problem with small examples... if I tried with 100 vectors, it would have missed some results as expected.

Lets try with a tweaked configuration:

SET hnsw.ef_search = 3;
SET

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
 id | category |      distance
----+----------+--------------------
  3 |        1 | 1.0714285714285714
  2 |        1 | 1.0714285714285714
(2 rows)

This is the partial result that I expected!

Iterative index scans in pgvector 0.8.0

Now lets see how the same query behaves in pgvector 0.8.0:

SELECT extversion FROM pg_extension WHERE extname = 'vector';
 extversion
------------
 0.8.0

SET hnsw.ef_search = 3;
SET

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
 id | category |      distance
----+----------+--------------------
  1 |        1 | 1.0714285714285714
  2 |        1 | 1.0714285714285714
  3 |        1 | 1.0714285714285714
(3 rows)

Great! pgvector 0.8.0 delivers! Yes, but not in the way you think:

SHOW hnsw.iterative_scan;
 hnsw.iterative_scan
---------------------
 off
(1 row)

Oops! hnsw.iterative_scan is disabled. So how did it still work? Use of explain shows that the index wasn't used:

EXPLAIN SELECT id, category, embedding<=>'[1,2,3]' AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
                             QUERY PLAN
---------------------------------------------------------------------
 Limit  (cost=25.09..25.10 rows=3 width=16)
   ->  Sort  (cost=25.09..25.11 rows=6 width=16)
         Sort Key: ((embedding <=> '[1,2,3]'::vector))
         ->  Seq Scan on filtest  (cost=0.00..25.02 rows=6 width=16)
               Filter: (category = 1)
(5 rows)

This is due to a different improvement in pgvector 0.8.0:

Improved cost estimation for better index selection when filtering

It is really silly to use a vector index when scanning a 6-row table. So with the improved cost estimation, it no longer happens. This is good news, but I really want to check the iterative scan. So lets force the use of index and try again:

SET enable_seqscan=false; -- force use of index, don't do this in production!
SET

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
 id | category |      distance
----+----------+--------------------
  2 |        1 | 1.0714285714285714
  3 |        1 | 1.0714285714285714
(2 rows)

SET hnsw.iterative_scan = relaxed_order; -- enable iterative scan
SET

SELECT id, category, embedding<=>'[1,2,3]'  AS distance
FROM filtest WHERE category=1 ORDER BY distance LIMIT 3;
 id | category |      distance
----+----------+--------------------
  2 |        1 | 1.0714285714285714
  3 |        1 | 1.0714285714285714
  1 |        1 | 1.0714285714285714
(3 rows)

And we can see that pgvector 0.8.0 and iterative scans work as expected. As a bonus, we also saw the cost optimization in action, and learned about hnsw.ef_search query configuration. One last tip: In production use, you will likely want an index on the columns you use for filtering, and not just the vector index.

If you are a Nile user, you already have pgvector 0.8.0, so I recommend taking advantage of it. Head over to our documentation for detailed explanation of iterative_scan options, and fancy optimizations.