In distributed database systems, reducing the amount of data transferred between nodes is critical for performance. Two powerful techniques used for this purpose are Bit Vector Filtering and Bloom Filters.
1. Why Do We Need Filtering Techniques?
When performing joins across distributed sites, transferring entire tables is expensive.
Instead, we:
- Send a compact representation of data
- Filter irrelevant rows early
2. What is Bit Vector Filtering?
A Bit Vector is a simple array of bits (0s and 1s) used to represent whether certain values exist.
How It Works
- Create a bit array of fixed size
- Hash each key to a position in the array
- Set that position to 1
Example
Keys: 10, 25, 30 Bit Vector (size 10): Index: 0 1 2 3 4 5 6 7 8 9 Value: 0 1 0 0 1 0 1 0 0 0
This vector is sent to another site to filter matching records.
3. Limitations of Bit Vector Filtering
- Collisions may occur (different values map to same bit)
- Limited accuracy
4. What is a Bloom Filter?
A Bloom Filter is an advanced version of bit vector filtering that uses multiple hash functions.
It can:
- Confirm if an element is definitely not present
- Indicate if an element is possibly present
5. How Bloom Filter Works
- Create a bit array
- Use multiple hash functions
- Set multiple bits for each element
Example
Insert key 10: Hash1 → position 2 Hash2 → position 5 Hash3 → position 7 Set bits at positions 2, 5, 7 to 1
To check a value:
- If any bit is 0 → element is NOT present
- If all bits are 1 → element MAY be present
6. Bloom Filter vs Bit Vector
- Bit Vector → Single hash function
- Bloom Filter → Multiple hash functions
7. Use in Distributed Databases
Bloom filters are widely used in distributed query processing, especially in join operations.
Example
- Site 1 creates a Bloom filter for join keys
- Sends filter to Site 2
- Site 2 filters rows using Bloom filter
- Only matching rows are sent back
8. Advantages of Bloom Filters
- Compact size
- Fast membership testing
- Reduces communication cost
9. Limitations
- False positives possible
- No deletion (in basic version)
10. Real-World Applications
- Distributed databases
- Big data systems (Hadoop, Spark)
- Caching systems
- Network security
11. Bit Vector vs Bloom Filter (Summary)
Bit Vector: - Simple - Less accurate - Single hash Bloom Filter: - Advanced - More accurate - Multiple hashes
Conclusion
Bloom filters and bit vector filtering are powerful techniques for reducing communication cost in distributed systems.
By filtering unnecessary data early using compact structures, they significantly improve query performance.
While Bloom filters may introduce small inaccuracies, their efficiency makes them essential in modern distributed databases.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.