View Single Post
  #46 (permalink)  
Old April 5th, 2002
Nosferatu's Avatar
Nosferatu Nosferatu is offline
Daemon
 
Join Date: March 25th, 2002
Location: Romania
Posts: 64
Nosferatu is flying high
Default Back to the TOPIC

SHA1 is the "agreed" method and is already implemented in some clients (Bearshare .. any others?)





The limit for conflicting files is not determined by the horizon ie 10,000 odd PCs - because you do not choose the 10,000 PCs to connect to and the files are not predetermined. This is covered in a statistics course, for anyone who would like to argue.



The limit is the number of available files in the world, which is in fact infinite, because they are created all the time and can be any size, storage is growing all the time.



But to be reasonable, for now, say no one is going to use gnutella to download a file bigger than 1G, so the number is the number of possible permutations of binary digits in a 1G file .. which is an astonishingly large number.



It's about 256<SUP>1000000000</SUP> if I am not getting too confused (I was, but I editted this message to try to fix it. I think I have the number approximately correct this time). It's actually more, since I didn't count the files smaller than 1G, but ... well, the number is too big for debian's arbitrary precision calculator to display, let's just leave it at that.
Most people in the systems administration field have been happily using MD5 for years, but files are growing so maybe MD5 is no longer considered enough.



I would like to see sources for why not.





The file hash should not be sent back with queries, since in most cases it is unecessary. The file size is plenty to make a rough estimate of duplicate files in queries received.



Second step should be when the user decides to download a file, then you request the file hash from the client serving the file.



Then you send out that file hash to find duplicate sources where the name differs greatly - believe me there are plenty.



Once the file is retrieved completely, the hash is taken and if it doesn't match what you were 'quoted', then you start again (maybe using a different source!;-)



OK? Comprende?



For small files, then you can use smaller hashes to determine duplicates, since the number of permutations of bits in a file of 10k is very small (comparitively!).

Perhaps when sending out the hash-based query for alternative sources, you send out the file size plus hash.





Here is another use for hashes:



Hashes would be great for eliminating downloading of files you already have.



The way it should be implemented though is not too filter the query results out qhen they are received, since this would require hashes for every single search result, but instead the hash should be retrieved when the user clicks on the file (or the automatic-download algorithm determines that a file matches its criteria) - then the hash should be retrieved, and if the file already exists on the users PC, the gnutella client just says 'skipping download - you already have the file in location x'.





Nos
<I>[Editted 5 Apr 2002 to fix guess at number of permutations in a 1 G file]</I>

Last edited by Nosferatu; April 5th, 2002 at 05:08 AM.
Reply With Quote