HLL
Probabilistic data structure that estimates the number of unique elements in a set.
The hll extension in PostgreSQL provides a HyperLogLog data structure for approximate distinct counting. It is highly efficient for estimating the number of unique elements in large datasets while using minimal memory.
Your Nile database arrives with hll
extension already enabled.
Let’s create a sample analytics system that efficiently tracks unique user interactions. We’ll use a fact table to store raw events and leverage HLL (HyperLogLog) for efficient approximate distinct counting in our aggregated statistics. This pattern is common in analytics systems where you need to track unique users across different time periods while maintaining reasonable storage and query performance.
Creating and Populating events
Table
Let’s create a sample table to store unique user interactions, using HLL for approximate distinct counts:
Understanding HLL Hashing
Before values can be added to an HLL data structure, they must first be hashed. The hll
extension provides several hashing functions for different data types:
hll_hash_integer(value)
- for integer valueshll_hash_text(value)
- for text valueshll_hash_bytea(value)
- for binary datahll_hash_any(value)
- for other data types
These hash functions convert the input values into consistent hash values that the HLL algorithm can process. Hashing ensures that:
- Values are uniformly distributed
- The same input always produces the same hash
- Different inputs are likely to produce different hashes
Approximate Counting with HLL - Estimating the Number of Unique Users
Here’s how to estimate the number of unique users for each event type across all dates.
Note that you can’t simply add up the HLL values to get the total distinct count, as this would double-count users.
Instead, you need to use the hll_union_agg
function:
Use Cases
- Efficient distinct counting for large-scale analytics.
- Web traffic analysis (e.g., unique visitors per day).
- Approximate user engagement tracking.
- Optimized analytics dashboards that require fast estimations.
Limitations
- HyperLogLog provides an approximate distinct count, not an exact value.
- The accuracy of estimates depends on the HLL precision settings.
- Cannot retrieve individual elements once they are added to an HLL aggregate.
Conclusion
The hll
extension in PostgreSQL is an efficient solution for large-scale distinct counting, offering fast and memory-efficient approximations. It is particularly useful for analytics and tracking unique values over time.
For more details, refer to the hll
GitHub repository.