Proposal for development of Gnutella (hashs)

maksik · #41 (**permalink**) January 30th, 2002

Now I feel like I have to make a comment. I have not been to the GDF for a while, but when I was, I was really unimpressed by the amount of mess overthere. There were nearly impossible to find something you are looking for as well as understand what those people finally agreed on. I have no time to participate in the forums. Really I develop Mutella part-time and thats ALL the time I can actually devote to the topic. It's a shame there's no place were I can go and check what were the latest updates to the protocol etc. Clip2 did the major contribution by releasing Gnutella Protocol Specification v0.4.x, and it's a shame nobody actually repeated that for v0.6 and later.

Btw, first functional version of Mutella was developed in ~ 1 month. Well, I have not done it from scratch, which I regret of :-)

Maksik

veniamin · #42 (**permalink**) January 30th, 2002

...since you both know programing languages why dont you help out on an existing open source project....

....and have to argue all the time about protocol extensions for gnutella, with other developers? Nope... thanx....

gnutellafan · #43 (**permalink**) January 31st, 2002

Well, to the programmers here who are not working on a current client let me suggest that you get involved and take gnutella in a new direction. Use the ideas, optimize the protocol and code, and add some great new features.

I think the most important thing that gnutella is lacking is security in the for of anonimity (sp?). I think the best way to do this, and I have expressed it here, the GDF, ect, is to add encryption to file transfers and to cache a percent of the file transfers. In addition, the cached files would be encrypted. Users would be required to provide at least 100mb or more HD space for encrypted files (or partial files) and may choose to share as much as they want of there own files. They would not be able to unencrypt those files and therefor would not know what they had. The network should not know if the files being transfered are encrypted files or regular shared files. Therefor noone could say who is sharing what. In addition this provides a huge benifit because now all users are sharing something, even if it is only the 100mb of encrypted files. If there was encryption I would of course also have no problem with the program to require that the download and partial folders be shared providing even more resources to the net.

It turns out that we have some very talented people here and it would be wonderful to see them greatly advance gnutella to gnutella2

I guess I am the only one here that doesnt know anything

#44 (**permalink**) January 31st, 2002

Gnutellafan, how about starting new threads and explain your ideas more detailed there?

Pferdo · #45 (**permalink**) April 5th, 2002

is this what you're looking for?
http://rfc-gnutella.sourceforge.net/

Nosferatu · #46 (**permalink**) April 5th, 2002

SHA1 is the "agreed" method and is already implemented in some clients (Bearshare .. any others?)

The limit for conflicting files is not determined by the horizon ie 10,000 odd PCs - because you do not choose the 10,000 PCs to connect to and the files are not predetermined. This is covered in a statistics course, for anyone who would like to argue.

The limit is the number of available files in the world, which is in fact infinite, because they are created all the time and can be any size, storage is growing all the time.

But to be reasonable, for now, say no one is going to use gnutella to download a file bigger than 1G, so the number is the number of possible permutations of binary digits in a 1G file .. which is an astonishingly large number.

It's about 2561000000000 if I am not getting too confused (I was, but I editted this message to try to fix it. I think I have the number approximately correct this time). It's actually more, since I didn't count the files smaller than 1G, but ... well, the number is too big for debian's arbitrary precision calculator to display, let's just leave it at that.
Most people in the systems administration field have been happily using MD5 for years, but files are growing so maybe MD5 is no longer considered enough.

I would like to see sources for why not.

The file hash should not be sent back with queries, since in most cases it is unecessary. The file size is plenty to make a rough estimate of duplicate files in queries received.

Second step should be when the user decides to download a file, then you request the file hash from the client serving the file.

Then you send out that file hash to find duplicate sources where the name differs greatly - believe me there are plenty.

Once the file is retrieved completely, the hash is taken and if it doesn't match what you were 'quoted', then you start again (maybe using a different source!;-)

OK? Comprende?

For small files, then you can use smaller hashes to determine duplicates, since the number of permutations of bits in a file of 10k is very small (comparitively!).

Perhaps when sending out the hash-based query for alternative sources, you send out the file size plus hash.

Here is another use for hashes:

Hashes would be great for eliminating downloading of files you already have.

The way it should be implemented though is not too filter the query results out qhen they are received, since this would require hashes for every single search result, but instead the hash should be retrieved when the user clicks on the file (or the automatic-download algorithm determines that a file matches its criteria) - then the hash should be retrieved, and if the file already exists on the users PC, the gnutella client just says 'skipping download - you already have the file in location x'.

Nos
[Editted 5 Apr 2002 to fix guess at number of permutations in a 1 G file]

#47 (**permalink**) April 12th, 2002

Quote:

Originally posted by Nosferatu
But to be reasonable, for now, say no one is going to use gnutella to download a file bigger than 1G, so the number is the number of possible permutations of binary digits in a 1G file .. which is an astonishingly large number.

It's about 2561000000000 if I am not getting too confused
[/B]

There's not even that amount of atoms in the whole Universe! Which I think is about 1090, more or less.

Smilin' Joe Fission · #48 (**permalink**) April 12th, 2002

Quote:

Originally posted by Unregistered
There's not even that amount of atoms in the whole Universe! Which I think is about 1090, more or less.

However, if you do the math, a 1GB file has 8589934592 bits. the number Nosferatu came up with is a total of all permutations of that 1GB file where 1 OR MORE of those bits has changed. When you change even 1 bit, the resulting file is a completely new file because its hash value will be different. I believe the number Nosferatu came up with may be pretty close.

As for the number of atoms in the universe.... I don't think that number is even close. Whatever scientist came up with that number is on drugs.

Taliban · #49 (**permalink**) April 12th, 2002

The number of atoms in the universe is about 10^78. You can estimate this number by counting galaxies, measuring how bright they are and estimating how big their mass is.

You don't need any drugs for that.

Nosferatu · #50 (**permalink**) April 13th, 2002

I just had a conversation on irc .. someone had a good idea, maybe some of you have heard it before.

Anyway, the idea is this: hash the first meg of the file as well as the whole file.

So that way you can tell that 'buffy vampire.divx' 20M is the same file as 'buffy vampyre.divx' 80M, and get at least the first 20M.

Then you repeat search later for files with first meg hash = x.

To implement this most reliably and sensibly would require instead of the HUGE proposal's technicque of always and only hashing the whole file, the best implementation would be to have a query 'please hash the file from range x-y'.

This shouldn't be totally automated .. because someone might have a text file which includes a smaller text file that should be considered complete .. eg they may have tacked some personall notes onto the end of some classic document. You probably don't want the extended version, so a user control button is needed 'Find bigger files whaich start off the same' or not.

In fact a really good implementation (but not necessary for each client to implement for it to work, as long as clients suppor the 'hash this part of the file please' extension, would be the one suggested below:

<Justin_> bigger or smaller

<Justin_> or have a slider hehe
<Justin_> the way md5sum works having 100 sums is not that intensive to make right?
<Justin_> cause its incremental no?
<Justin_> so if you had a slider in the program, that starts at 100%, that you can lower by 10% incremnts to find more files
<Justin_> as in, the default is files that match 100%, or files that match at the 90% mark, well not % it would have to be 10M intervals

Having the ability to reuest hashes for arbitrary portions of files would additionally make their use for verifying contents reliable - if someone could generate two files with the same hashes (or when this happens randomly) simply checking the hash for a given subportion would detect the difference.

Nos

----------

Quote:

Originally posted by Smilin' Joe Fission

However, if you do the math, a 1GB file has 8589934592 bits. the number Nosferatu came up with is a total of all permutations of that 1GB file where 1 OR MORE of those bits has changed. When you change even 1 bit, the resulting file is a completely new file because its hash value will be different.

Well, this is the question. Is the HASH indeed large enough to have a unique value for each individual permutation of a 1G file, and if not, does it really matter?

Certainly we are not going to generate each version of a 1G file that is possible .. ever (well, unless some pr!ck sits down in the far future and does it on purpose as a programming exercise using some newfangled superdupercomputer we can't even imagine yet .. but I stray from the topic). We do need a hash that has enough values that most probably each individual file we generate will have a unique value .. but it can't be known for sure unless you actually generate the hash for each file (ie generate each file).

Hashes are funny things. (I'm still searching for a good reference to back that statement up .. but don't have time to find right now .. see later posting.)

I think if you look at the file size and the hash, you have enough certainty to call it a definite match in searching for alternate download sources. Better techinuqe described above in first portion of post.

Quote:

I believe the number Nosferatu came up with may be pretty close.

As for the number of atoms in the universe.... I don't think that number is even close. Whatever scientist came up with that number is on drugs.

I did a quick one on my calculator based on figure for 'mass of observable universe' from O'Hanian 'Physics' text book .. and 1e70 would seem to be what "they" think (the scientists). But I agree about the drugs

<A HREF="http://groups.google.com/groups?q=number+atoms+universe&hl=en&scoring=r&sel m=4kc1fu%24gej%40agate.berkeley.edu&rnum=1">This</A> will do as a reference - at least the guy has the word 'physics' in his email, as well as the word 'berkely'. I couldn't be bothered checking any more thoroughly than that.

Nos
[Editted 14-04-2000 to add URL reference for atom count left out of initial post]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Gnutella Protocoll v0.7 Proposal	Moak	General Gnutella Development Discussion	41	August 17th, 2002 11:55 AM
gnutella development plans	Iamnacho	General Gnutella Development Discussion	11	March 9th, 2002 07:21 PM
My Proposal for XoloX!!!	Unregistered	User Experience	1	February 6th, 2002 09:11 AM
Xolox and Gnutella development	Moak	Rants	6	November 25th, 2001 07:05 AM
---a Radical Proposal---	Unregistered	General Gnutella / Gnutella Network Discussion	0	September 21st, 2001 01:08 PM