Bloom Filters and Bit Vector Filtering Explained - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Community

Bloom Filters and Bit Vector Filtering Explained

Share This

In distributed database systems, reducing the amount of data transferred between nodes is critical for performance. Two powerful techniques used for this purpose are Bit Vector Filtering and Bloom Filters.

Core Idea: Use compact data structures to filter unnecessary data before transferring it across the network.

1. Why Do We Need Filtering Techniques?

When performing joins across distributed sites, transferring entire tables is expensive.

Instead, we:

  • Send a compact representation of data
  • Filter irrelevant rows early
Goal: Minimize communication cost by reducing data transfer.

2. What is Bit Vector Filtering?

A Bit Vector is a simple array of bits (0s and 1s) used to represent whether certain values exist.

How It Works

  1. Create a bit array of fixed size
  2. Hash each key to a position in the array
  3. Set that position to 1

Example

Keys: 10, 25, 30
Bit Vector (size 10):
Index:  0 1 2 3 4 5 6 7 8 9
Value:  0 1 0 0 1 0 1 0 0 0

This vector is sent to another site to filter matching records.


3. Limitations of Bit Vector Filtering

  • Collisions may occur (different values map to same bit)
  • Limited accuracy
Problem: False matches may occur due to hash collisions.

4. What is a Bloom Filter?

A Bloom Filter is an advanced version of bit vector filtering that uses multiple hash functions.

Definition: A probabilistic data structure used to test whether an element is a member of a set.

It can:

  • Confirm if an element is definitely not present
  • Indicate if an element is possibly present

5. How Bloom Filter Works

  1. Create a bit array
  2. Use multiple hash functions
  3. Set multiple bits for each element

Example

Insert key 10:
Hash1 → position 2
Hash2 → position 5
Hash3 → position 7

Set bits at positions 2, 5, 7 to 1

To check a value:

  • If any bit is 0 → element is NOT present
  • If all bits are 1 → element MAY be present

6. Bloom Filter vs Bit Vector

  • Bit Vector → Single hash function
  • Bloom Filter → Multiple hash functions
Advantage: Bloom filters reduce false positives compared to simple bit vectors.

7. Use in Distributed Databases

Bloom filters are widely used in distributed query processing, especially in join operations.

Example

  1. Site 1 creates a Bloom filter for join keys
  2. Sends filter to Site 2
  3. Site 2 filters rows using Bloom filter
  4. Only matching rows are sent back
Result: Significant reduction in data transfer.

8. Advantages of Bloom Filters

  • Compact size
  • Fast membership testing
  • Reduces communication cost

9. Limitations

  • False positives possible
  • No deletion (in basic version)
Note: False positives increase slightly but overall efficiency improves.

10. Real-World Applications

  • Distributed databases
  • Big data systems (Hadoop, Spark)
  • Caching systems
  • Network security

11. Bit Vector vs Bloom Filter (Summary)

Bit Vector:
- Simple
- Less accurate
- Single hash

Bloom Filter:
- Advanced
- More accurate
- Multiple hashes

Conclusion

Bloom filters and bit vector filtering are powerful techniques for reducing communication cost in distributed systems.

By filtering unnecessary data early using compact structures, they significantly improve query performance.

While Bloom filters may introduce small inaccuracies, their efficiency makes them essential in modern distributed databases.



Happy Exploring!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.