Proposal for development of Gnutella (hashs)

#1 (**permalink**) January 5th, 2002

When a servant process a query and return the filename and filesize available. It should also return a hash code.

This would download from multiples sources even when the users rename the original file.

A good hash (20bytes) with a checking of the filesize should avoid "false duplicates"

Marc.
marc@szs.ca

Moak · #2 (**permalink**) January 5th, 2002

yep, I suggest/vote for it too.

There is allready a well documented proposal, named 'HUGE' [1]. From their "Motivation & Goals":

o Folding together the display of query results which represent the exact same file -- even if those identical files have different filenames.

o Parallel downloading from multiple sources ("swarming") with final assurance that the complete file assembled matches the remote source files.

o Safe "resume from alternate location" functionality, again with final assurance of file integrity.

o Cross-indexing GnutellaNet content against external catalogs (e.g. Bitzi) or foreign P2P systems (e.g. FastTrack, OpenCola, MojoNation, Freenet, etc.)

[1] "HUGE" - http://groups.yahoo.com/group/the_gd...roposals/HUGE/ (Yahoo account required)

#3 (**permalink**) January 6th, 2002

The HUGE thing look complicate for no reason. The risk of making an error on dups files is close to impossible (1 / Billions * Billions) with an 160bits hash and filesize check. You simply do the download and check that the receive file match the hash...

And it's quite simple to code a component that can download from mutliple sources (swarm). You simply test the servers for resume capability, split the files to download in blocks, and create threads that request theses blocks on differents server, you can even create mutliple threads (connections) per server.

In order to improve download/connection speed each client should have a list of other client that have the same file and reply not only with there IP but with the IP of others that can provide the same file. This could be done if Hub (supernodes) are inserted in the network. They could scan for dups files!

I already have program a swarm component in Delphi and it's working well. I will now work on it to add on the fly add/remove of new download sources.

If any one want to work on it to let me know i will send you the sources. It's use Indy for TCP access.

Marc.
marc@szs.ca

Moak · #4 (**permalink**) January 6th, 2002

I thought HUGE is simple and flexible?

It does explain a lot of basics and also details how to include hashs into binary Gnutella messages (did you recognize that you have to encode 0x00, also compatibility with other/older clients is guaranteed)? If you think it can be done easier, write a paper... I prefer easy solutions. :-)

PS: What you said about swarming and a list of alternatives downloads. Yes, this is another advantage when we have hashs and Queryhit-caches. I'm a big fan of superpeers, hashs, swarming and multisegmented downloading. :-)

veniamin · #5 (**permalink**) January 6th, 2002

I am not sure but i think a CRC could do the job. For each file in a query hit we can put its CRC between the two nulls like Gnotella does for the MP3 files.

Moak · #6 (**permalink**) January 6th, 2002

You can do that with HUGE. It also describes the encoding between the two nulls, then the new GET request. I think it preferes SHA1 for the hash, but which you use is flexible, CRC, MD5...

The question I have, which is the best algorithm? Can someone give a summary/overview? Hmm, it should be unique enough within a typical horizon (high security is not the topic), small in size (to keep broadcast traffic low), fast to calculate.

#7 (**permalink**) January 6th, 2002

About Huge, when i said Huge look complicate i mean from what you tell me it's more about verification of data integrity.

I prefer less verification and better speed (smaller protocol)
as long as verification is good enough.

About CRC it's really not a good idea to use CRC16 or CRC32 the last one give 4 Billions values, that not enough you could get wrong duplicates files. SHA1 use a 20 bytes (160bits) this give a lot more possibility. To give an idea it would be arround,
4Billions * 4Billions * 4Billions, ... You get the point, with this amount of possibility you reduce the change of making false duplicates.

SHA1 speed is fast enough +1MB/Sec but it's not that important, a client program could generate all the sha1 at startup and cache
this information in memory 1000 files would required only 20KB
of memory. Doing a hash for each query would not be a good idea...

Marc

Moak · #8 (**permalink**) January 6th, 2002

What do you mean with 'verification'?
The HUGE goals describe exactly what we really need today: a) efficient multisegmented downloading/caching (grouping same files together from Queryhits for parallel downloads or Query caches) - b) efficient automatic requerying (finding alternative download loactions).

I agree, the protocol should be as small as possible.
While you agree with SHA1 (I still have no clue about advantages/disadvantages in CRCxx, SHA1, MD5, TigerTree etc), what could be done better as described in the HUGE paper? I think HUGE is pretty simple. It describes hash positioning in Query/Queryhits and necesarry HTTP headers. Then it encodes the hash to make it fit into Gnutella traffic (null inside the hash must encoded!) and also into HTTP traffic. Example, the well known 'GnutellaProtocol04.pdf' becomes 'urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB'.
Perhaps you don't agree with BAS32 encoding of the hash, what could be done better?

CU, Moak

#9 (**permalink**) January 7th, 2002

I know nothing about Huge i simply did get this from your previous post:

> Safe "resume from alternate location" functionality, again with final assurance of file integrity.

For me "final assurance" mean once download is complete you must do some kind of block check on the original source, with multiple CRC to verify that all the block receive match the original file.

This is what i call "final assurance". Like i said i don't know Huge, what i'm proposing is "not final assurance". Only perform a Sha1 and download the file from all matching sources. Without performing checking at the end of transfer. If HUGE is doing this then they can't tell "final assurance of file integrity" but it's exactly what i want to to.

To have "final assurance" would use to much bandwidth, performance would be better with a small risk for corrupted file, if it's in the range or 1/10000000000000000 sound ok to me.

I will try to take time and check the HUGE principle.

CRC vs SHA1: CRC is as good as SHA1 for randomly select a number corresponding to a data. But SHA1 add security, SHA1 was built so that it's impossible to recreate any original data from the hash key (good for password storing). And of course it generate larger number since it's a 20bytes key vs 4 bytes for CRC32

Marc.

Tamama · #10 (**permalink**) January 7th, 2002

The only thing i find somewhat weird at HUGE is that the sha1 is Base32 encoded. This means only 5 bits of a 8 bits byte are used. Just doesnt make sence... oh well

The GET request is somewhat strange as well... a simple:

GET urn:sha1:452626526(more of this ****)SDGERT GNUTELLA/0.4

would work just as well...

Some thoughts..

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Gnutella Protocoll v0.7 Proposal	Moak	General Gnutella Development Discussion	41	August 17th, 2002 10:55 AM
gnutella development plans	Iamnacho	General Gnutella Development Discussion	11	March 9th, 2002 06:21 PM
My Proposal for XoloX!!!	Unregistered	User Experience	1	February 6th, 2002 08:11 AM
Xolox and Gnutella development	Moak	Rants	6	November 25th, 2001 06:05 AM
---a Radical Proposal---	Unregistered	General Gnutella / Gnutella Network Discussion	0	September 21st, 2001 12:08 PM