I just had a conversation on irc .. someone had a good idea, maybe some of you have heard it before.
Anyway, the idea is this: hash the first meg of the file as well as the whole file.
So that way you can tell that 'buffy vampire.divx' 20M is the same file as 'buffy vampyre.divx' 80M, and get at least the first 20M.
Then you repeat search later for files with first meg hash = x.
To implement this most reliably and sensibly would require instead of the HUGE proposal's technicque of always and only hashing the whole file, the best implementation would be to have a query 'please hash the file from range x-y'.
This shouldn't be totally automated .. because someone might have a text file which includes a smaller text file that should be considered complete .. eg they may have tacked some personall notes onto the end of some classic document. You probably don't want the extended version, so a user control button is needed 'Find bigger files whaich start off the same' or not.
In fact a really good implementation (but not necessary for each client to implement for it to work, as long as clients suppor the 'hash this part of the file please' extension, would be the one suggested below:
<Justin_> bigger or smaller
<Justin_> or have a slider hehe
<Justin_> the way md5sum works having 100 sums is not that intensive to make right?
<Justin_> cause its incremental no?
<Justin_> so if you had a slider in the program, that starts at 100%, that you can lower by 10% incremnts to find more files
<Justin_> as in, the default is files that match 100%, or files that match at the 90% mark, well not % it would have to be 10M intervals
Having the ability to reuest hashes for arbitrary portions of files would additionally make their use for verifying contents reliable - if someone could generate two files with the same hashes (or when this happens randomly) simply checking the hash for a given subportion would detect the difference.
Nos
----------
Quote:
Originally posted by Smilin' Joe Fission
However, if you do the math, a 1GB file has 8589934592 bits. the number Nosferatu came up with is a total of all permutations of that 1GB file where 1 OR MORE of those bits has changed. When you change even 1 bit, the resulting file is a completely new file because its hash value will be different. |
Well, this is the question. Is the HASH indeed large enough to <i>have</i> a unique value for each individual permutation of a 1G file, and if not, does it really matter?
Certainly we are not going to generate each version of a 1G file that is possible .. ever (well, unless some pr!ck sits down in the far future and does it on purpose as a programming exercise using some newfangled superdupercomputer we can't even imagine yet .. but I stray from the topic). We do need a hash that has enough values that <i>most probably</I> each individual file we generate will have a unique value .. but it can't be known for sure unless you actually generate the hash for each file (ie generate each file).
Hashes are funny things. (I'm still searching for a good reference to back that statement up .. but don't have time to find right now .. see later posting.)
I think if you look at the file size and the hash, you have enough certainty to call it a definite match in searching for alternate download sources. Better techinuqe described above in first portion of post.
Quote:
I believe the number Nosferatu came up with may be pretty close.
As for the number of atoms in the universe.... I don't think that number is even close. Whatever scientist came up with that number is on drugs. |
I did a quick one on my calculator based on figure for 'mass of observable universe' from O'Hanian 'Physics' text book .. and 1e70 would seem to be what "they" think (the scientists). But I agree about the drugs
<A HREF="http://groups.google.com/groups?q=number+atoms+universe&hl=en&scoring=r&sel m=4kc1fu%24gej%40agate.berkeley.edu&rnum=1">This</A> will do as a reference - at least the guy has the word 'physics' in his email, as well as the word 'berkely'. I couldn't be bothered checking any more thoroughly than that.
Nos
<I>[Editted 14-04-2000 to add URL reference for atom count left out of initial post]</I>