An efficient way of calculating distinct count | HyperLogLog
The time complexity of calculating distinct count is of order O(N) which is pretty huge if you have the requirement to have real-time analysis on these metrics.
Let’s take an example, Facebook wants to know active users/per minute on the platform. Now, people can log in multiple times with different devices. When you are operating at 2.5 billion users you can’t just fire a distinct count query in the SQL and expect it to get completed in real-time. But if you analyze the situation we are concerned about the approximate number of the distinct user not interested in the precise value of it.
What is Hyperloglog?
A HyperLogLog is a probabilistic data structure used to count unique values — or as it’s referred to in mathematics: calculating the cardinality of a set.
In simple words, Instead of calculating the actual number of unique values, it estimates the count based on the maximum number of trailing zero in the binary representation of each number in the set. Not simple enough?
Let’s say we have 4 integers [2,5,8,3]
2 => 10 => number of trailing zero is 1
5 => 101 => number of trailing zero is 0
8 => 1000 => number of trailing zero is 3
3 => 11 => number of trailing zero is 0
so maximum trailing zero is 3.
And if we consider that the data is uniformly distributed(by using a good hash function) then how many distinct numbers should be there to get 3 trailing zero?
let’s calculate how many distinct numbers should be there to get 1 trailing zero? In binary representation, the last bit can be 0/1 right? so we should have 2 number probabilistically in the set to get 1 trailing zero. Similarly if the maximum trailing zero of set is 3 that means the cardinality of that should be around 8 (2³).
You must be wondering that the error rate of this algo is huge. For 4 integer it is saying that there are 8 unique integers in the set. The above explanation is just the base of the algorithm. In reality, instead of calculating the maximum over the whole set, numbers are divided uniformly into buckets, and the maximum of each bucket is used to calculate the cardinality of a set(Harmonic mean of all the values). Refer to this for more details.
The error is less than 1% standard error that means for a set with 1000,000 distinct users it should predict in the range [998,999 to 1010,000] which is good enough for our example.
Why the name is loglog?
If you would have observed the space complexity of this algo is loglog(N) as we are just storing the maximum number of trailing zero of the set.
Redis has HLL as one of the built-in data structures. Some stats on the of HLL in the Redis:
Accuracy: approximated with a standard error of 0.81%.
Time complexity: O(1)
Redis HLL is 1000x faster than the SQL distinct query and gives a result close to the real value.