Bloom Filters

Basically i got to know about this from hinge, yes i got everything from hinge except date

lol

I was swiping left and right, then i thought what would the under the hood algo works like, profiles and metadata must be stored in some database, but which profile to show, and which to avoid, in that also there is fix size of profiles for me, how do they avoid showing me same profile multiple times, and then we went to rabbit hole, and here are it’s learnings ;)

found some really great use case example from the internet, which helped me to understand this thing: From Wikipedia:

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

https://stackoverflow.com/questions/4282375/what-is-the-advantage-to-using-bloom-filters

A bloom filter doesn’t store the elements themselves, this is the crucial point. You don’t use a bloom filter to test if an element is present, you use it to test whether it’s certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don’t exist in a set (such as disk IO to look them up).

It is probabilistic data structure that can*:

add element to a set
check if an element is in the set by telling definitely not in the set or possibly in the set

But it can’t *:

remove an item from the set
give you a list of all elements that are currently in your set

false positive rate:

my first thought was who names things like this, false positive, but after you understand it makes sense,

there can be cases where it falsely thinks that the element is positive but false negative are impossible.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set or not. It is called probabilistic because while a negative answer is 100% accurate, there might be false positives means that if the answer is “Yes, this element presents in the data set` it is not 100% accurate.

how it works:

you initiate an empty bit array of length m
you select k different hash functions (the more independent the better)
if you want to add element, you calculate all the k hashes of this value and set the corresponding bits to 1
if you want to check if element exist, you also calculate all k hashes and if at least one of them is not set, it is surely is not in the set. Otherwise it can be in the set.

Even this description is enough to understand why we can’t be sure (you can get all bits set from various other values). Here is a very nice visualization of how it works.

](https://i.sstatic.net/avlAi.png)

So when can bloom filters be useful? The short answer is everywhere where false positive are acceptable and where you would want to check if something is in the set, but even if they are not, it can be a first line of defense to rule out expensive calls to verifiers.

Here is a list of more concrete descriptions:

a standard example of malicious websites and a browser is described in almost any place where people talk about bloom filters
is a passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
if you have a list of articles and a list of users, you can use bloom filter to show users’ articles they have not read. Interesting thing is that you can have only one filter (you check whether the combination of user_id + article_id is there)
bitcoin uses bloom filter for wallet synchronization
Akamai’s web servers use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom’s filter article at wiki)

How to select hash function?

In Bloom filters, hash functions should ideally have two main properties:

Uniform Distribution: The hash function should distribute the output (hash values) uniformly across the bit array. This ensures that each bit in the array is equally likely to be set, minimizing the chances of collisions and thereby reducing false positives.
Independence: Multiple hash functions used in a Bloom filter should be independent of each other. This means that the outcome of one hash function should not influence the outcome of another.

cool site: https://hur.st/bloomfilter/

need to use or work with this to understand it better

https://www.youtube.com/watch?v=V3pzxngeLqw https://www.youtube.com/watch?app=desktop&v=bgzUdBVr5tE

Karan's Streams

Explorer

Bloom Filters

false positive rate:

how it works:

How to select hash function?

Graph View

Backlinks