View Single Post
  #48 (permalink)  
Old September 3rd, 2006
macho6868 macho6868 is offline
Enthusiast
 
Join Date: June 14th, 2006
Posts: 30
macho6868 is flying high
Default

The purpose of this one-way-hash filter is to have large scale hashing in a small subset of data.

It is similar to a bloom filter except that instead of using a binary system the filter uses an a-z system where each point can hold more that one value and is not limited to binary values.


View source code v.01 going to add in bitwise shifting soon.
View example of a spellchecker done entirely in php with a 300,000 word dictionary (2.5M), the hash table is a 50x50 (1 in 3,250,000 probabilty) at that size the hash table is 419k serialized and uncompressed, it could be done in a 25x25 hash table which would be around 200k however the probabilty of false positives is much much higher (Compare 25x25 matrix example against the above example).


--------------------------------------------------------------------------------

Why use these?

A 250x250 matrix (large) has a unique factor of over 40 million and is around 16 megabytes blank. The structure is designed for large hashes. It would be easily possible to hold the entire domain name system hashed in around 20 megs. What this means is you could have very fast lookups on if a name is in the system.


--------------------------------------------------------------------------------

How does this filter work?


Take an item and hash it four times, modulus each value to the following values.
A marker for the X axis of the matrix
A marker for the Y axis of the matrix
A marker for how deep into the hash at the matrix point to go into.
A marker for the letter to use,
Insert it twice by flipping the X and Y values.

Each item hashed in the database will have a unique value set to each of the above markers. For lookups, the letter must be in two exact places withing the entire filter. In the above example of the spelling checker, there is a 1/1.625M probabily of a duplicate at any given point (due to doubling data inserting).

These filters can take a while to generate, they are very processor intensive to create. These filters use a matrix format for speed in lookups, each matrix point holds a small filter.


--------------------------------------------------------------------------------

How is that this filter can hold so much information?

Example small filter

0000000000

Insert "d" at position '2' (note:Counting starts at zero)

00d0000000

Insert "z" at position '5' (note:Counting starts at zero)

00d00z0000

Now it becomes interesting. insert "f" at position '2' (on top of another letter in place)

00Df00z0000

If another needs to be positioned at the same point. insert "g" at position '2' (on top of another two letters in place)

00DFg00z0000

The array maintains all positions and if it encounters an uppercase letter it concats until a non-uppercase letter is found.

Array
(
[0] => 0
[1] => 0
[2] => DFg
[3] => 0
[4] => 0
[5] => z
[6] => 0
[7] => 0
[8] => 0
[9] => 0
)

Multiple hashing algorithms are used for placements.

TO DO:


Non-XY-flip version that uses a unique secondary value or a combination of other values, however I felt that at this time the number of hashes is approaching maximum.
More documentation and list of papers cited.
List of links to projects/experiments using bloom filters for large datasets.
Reply With Quote