Week 7 - Homework 5 (unassessed)

Week 8 - Bloom Filters

This week we will be focusing on hash functions. First we will look at a probability question.

Balls in bins problem

If you randomly throw identical balls in identical bins for where each throw is independent of on another. Define number of balls in bin . How large is the max load = ?

We are going to show that …

So first look at

by Stirling’s formula and the definition of the Binomial coefficient.

So by letting we can further say Using this as there are bins we can bound the max load Infact further analysis can show that with high probability max load = .

Best of 2 approach

Best of 2(n)
	Input: positive integer n
	Output: allocation of n balls to n bins
1. For i = 1 -> n
	2. Randomly select two bins j & k.
	3. If load(j) < load(k) assign ball to bin j
	4. Otherwise assign ball to bin k

In this approach we get the max load is with high probability.

Note if we increase 2 to then we only get a slight decrease in max load to with high probability.

Hashing

Unacceptable passwords

We want to check a user doesn’t put in a password that is considered unacceptable. We have a huge universe of possible passwords . We want to maintain of unacceptable passwords.

For a given password we want to be able to answer is ?

In Chain Hashing, we use a hash table (, ) of a given size . Where the entries in the associative array are linked lists that handle collisions.

For notation let and have then ideally we will have .

In Chain Hashing the length of the linked lists behave similarly to the balls in the bins problem. Therefore, if with high probability look up time will be which will be too high if is sufficiently large. This is where the best of 2 approach can help us.

Best of 2 hashing

Instead of using a hash table with only one hash function instead lets use two independent hash functions and . Then we do the following for operations:

  • To add we:
    • Compute and , and
    • add to the least loaded of and .
  • To check if is in
    • Compute and , and
    • check for in and .

From the analysis we did above if then this should have run time of .

Bloom filters

Bloom filters are another way to solve the unacceptable passwords problem.

  • Checks take time,
  • It uses less space than Chain Hashing, and
  • it is a simpler algorithm. However, there is a small probability of false positives but we say it is!