Previous: , Up: Lists   [Contents][Index]


11.13 Bloom Filters

The concept of a Bloom filter was introduces by Burton H. Bloom in 1970. It is a constant space complexity, probabilistic data structure that is used to do so-called membership decisions on anonymous sets. That is, you can easily decide whether a given element is a member of a certain set, whereas you cannot select elements from it (or even traverse through all elements). Moreover, like hash-tables the time complexity of Bloom filters for adding and removing elements, as well as membership-decision is in O(1) – it is in O(k) indeed where k is the degree of the Bloom filter.

Probabilistic, however, means that false positives are possible, but false negatives are not. The probability of false positives grows subexponentially with the number of added elements.

Another interesting property of Bloom filters of equal order and degree is the ability to perform set union and intersection operations within O(1) space and time complexity!

SXEmacs’ Bloom filters have their own lisp-type, but they do not have a special input syntax.

Function: make-bloom &optional order degree

Return an empty bloom-filter.

Optional argument order (a positive integer) specifies the “length” of the internal filter vector, defaults to 256.

Optional argument degree (a positive integer) specifies the number of hash slices, defaults to 8.

For reasons of convenience, we also provide a constructor for a complete Bloom filter. That is a bloom filter which owns any possible element.

Function: make-bloom-universe &optional order degree

Return a complete bloom-filter.

Optional argument order (a positive integer) specifies the “length” of the internal filter vector, defaults to 256.

Optional argument degree (a positive integer) specifies the number of hash slices, defaults to 8.

Function: bloomp object

Return non-nil if object is a bloom filter, nil otherwise.

There are two further auxiliary functions to determine the bloom filter parameters:

Function: bloom-order bloom

Return the order of the bloom-filter bloom.

Function: bloom-degree bloom

Return the degree of the bloom-filter bloom.

Adding/removing elements to/from a bloom filter always works by side-effect.

Function: bloom-add bloom element

Add element to the bloom-filter bloom.

Function: bloom-remove bloom element

Remove element from the bloom-filter bloom.

The membership decision is done with bloom-owns-p.

Function: bloom-owns-p bloom element

Return non-nil if element is in the bloom-filter bloom.

(setq bl (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloomp bl)
  ⇒ t

Now we want to add three integers and test some memberships.

(bloom-add bl 12)
  ⇒ 12
(bloom-add bl 21)
  ⇒ 21
(bloom-add bl 0)
  ⇒ 0
(bloom-owns-p bl 12)
  ⇒ t
(bloom-owns-p bl 21)
  ⇒ t
(bloom-owns-p bl 0)
  ⇒ t
(bloom-owns-p bl 13)
  ⇒ nil
(bloom-owns-p bl 17)
  ⇒ nil

Now let us remove some elements

(bloom-remove bl 12)
  ⇒ 12
(bloom-remove bl 0)
  ⇒ 12
(bloom-owns-p bl 12)
  ⇒ nil
(bloom-owns-p bl 21)
  ⇒ t
(bloom-owns-p bl 0)
  ⇒ nil

Of course, Bloom filters work with any lisp type (not just integers). Internally, they use an object’s hash value and scatter it over a fixed-length array.

(bloom-owns-p bl "horse")
  ⇒ nil
(bloom-add bl "horse")
  ⇒ "horse"
(bloom-owns-p bl "horse")
  ⇒ t
(bloom-owns-p bl "snake")
  ⇒ nil

As remarked above, union and intersection of Bloom-filters of equal order and degree is possible in constant time regardless of the size.

Function: bloom-union &rest blooms

Return the union Bloom filter of all arguments.

Function: bloom-intersection &rest blooms

Return the intersection Bloom filter of all arguments.

(setq bl1 (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-add bl1 2)
  ⇒ 2
(bloom-add bl1 4)
  ⇒ 4
(bloom-add bl1 'foobar)
  ⇒ foobar
(setq bl2 (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-add bl2 "horse")
  ⇒ "horse"
(bloom-add bl2 "snail")
  ⇒ "snail"
(bloom-add bl2 'foobar)
  ⇒ foobar
(setq blu (bloom-union bl1 bl2))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 6>
(bloom-owns-p blu 2)
  ⇒ t
(bloom-owns-p blu 4)
  ⇒ t
(bloom-owns-p blu "horse")
  ⇒ t
(bloom-owns-p blu "snail")
  ⇒ t
(bloom-owns-p blu 'foobar)
  ⇒ t
(setq bli (bloom-intersection bl1 bl2))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-owns-p bli 2)
  ⇒ nil
(bloom-owns-p bli 4)
  ⇒ nil
(bloom-owns-p bli "horse")
  ⇒ nil
(bloom-owns-p bli "snail")
  ⇒ nil
(bloom-owns-p bli 'foobar)
  ⇒ t

Now we want to illustrate the extreme performance gain over lists, and compare to the performance of hash-tables. Therefore, we use a setup routine which fills a data container with 10000 distinct elements. Then we do 50000 membership tests. Let’s start with ordinary lisp lists. For each of these both tests we measure the time.

(defun test-element ()
  "Return a random test element."
  (incf cnt))
  ⇒ test-element
(setq tlist (list))
  ⇒ nil
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (setq tlist (cons (test-element) tlist)))
  (- (current-btime) start))
  ⇒ 140733
;; 50000 membership tests on tlist … take a coffee break
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (member (test-element) tlist))
  (- (current-btime) start))
  ⇒ 154817323

Let’s consider hash-tables:

(setq thash (make-hash-table))
  ⇒ #<hash-table size 0/29 0x1619>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (puthash (test-element) 'some-value thash))
  (- (current-btime) start))
  ⇒ 173210
;; 50000 membership tests on thash
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (gethash (test-element) thash))
  (- (current-btime) start))
  ⇒ 723030

Let’s finish our considerations with Bloom filters:

(setq tbfil (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (bloom-add tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 128565
;; 50000 membership tests on tbfil
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (bloom-owns-p tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 757945

Evaluating all these results, we can state that hash-tables and Bloom-filters provide equal performance. For extremely large sets, Bloom-filters may profit from the absence of resizing the underlying array.

;; 400000 insertions into a bloom-filter
(setq tbfil (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 400000)
    (bloom-add tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 5464420
;; 400000 insertions into a hash-table
(setq thash (make-hash-table))
  ⇒ #<hash-table size 0/29 0x1066>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 400000)
    (puthash (test-element) 'some-value thash))
  (- (current-btime) start))
  ⇒ 6392474

Furthermore, Bloom filters are able to track how often equal elements have been inserted, for hash-tables the value field may be used to implement something similar.

However, bloom filters tend to report more and more false positives the more elements you add. False negatives are never possible. That means with a sufficiently small bloom filter and a sufficiently large set of elements it is just a matter of probability when this filter will turn into a universe.

(setq b4 (make-bloom 256 64))
  ⇒ #<bloom-filter :order 256 :degree 64 :size 0>

;; now add 800000 numbers to it
(dotimes (i 800000)
  (bloom-add b4 i))
  ⇒ nil

;; now check this
(bloom-owns-p b4 'a-symbol-I-never-added)
  ⇒ t
(bloom-owns-p b4 "a string I never added")
  ⇒ t
(bloom-owns-p b4 [a vector I never added])
  ⇒ t

We see that b4 has turned into a quasi-universe. There might be objects which are not yet in b4, but the probability to find one is extremely small.

Given a bloom filter of degree d and size s, and given the bloom filter already contains n elements, the exact probability to get a false positive (an element which has not been added, but bloom-owns-p reports so) is:

So for our example the probability is veeery close to one, to be precise, it is:

0.9999...99993386493554255105254275560...
  ^^^^^^^^^^^
  87027 times

Previous: , Up: Lists   [Contents][Index]