Gnutella Forums - View Single Post

Vinnie · #5 (**permalink**) July 4th, 2002

BearShare's solution is to create two tables.

One starts out empty (call this full) and the other starts out big enough for 1,000 GUIDs (call this recent).

Each time you see a GUID, check full, and then check recent, in that order, to see if it already exists.

Each table can use a binary tree for efficient searching.

If it already exists, do nothing.

If it doesn't exist, then the binary traversal that you took will show you where to insert it. Insert the new GUID into recent.

When recent is full, perform the following algorithm:

- Subtract the time that an entry was first stored in recent from the current time. This is the time in seconds it took to fill recent

- Calculate the number of entries required to store 10 minutes worth of GUIDs, based on the number of GUIDs stored in recent (1,000 initially). The formula is:

needed = ( count * 10 * 60 ) / seconds

where count is the number of entries in recent, seconds is the time to fill recent, and needed is the number of entries of storage needed to hold 10 minutes worth of data.

- Discard the contents of full and reallocate full to needed entries.

- Swap full and recent

At this point, table full is really full, and table recent is empty, but has enough room to hold needed items.

This algorithm requires only two variably sized allocations. Using a binary tree, you get the fastest possible find. As an added bonus, insertions execute in constant time because you can re-use the information gleaned during the find (which you had to perform anyway to check for duplicates) to determine the insertion point.

Using a slight variation, the same table structure can be used to store PUSH routes. The main difference for storing push routes, is that entries are always added into recent even if they already exist in full. It is also wise to keep these push routing tables on a per-host basis, and do a binary search in each pair of tables, for each host, when routing pushes.

This solution automatically adapts to changing network conditions. It guarantees at least 10 minutes worth of storage, and usually provides more (since there are two tables).

The memory demands are modest, and there is, on average, only a single allocation performed every 10 minutes.

I rather doubt a better algorithm exists.