Discussion:
Q: Pointers to results about statistical characteristics of hash functions?
mockturtle
2010-02-15 06:26:04 UTC
Dear all,
I am searching for results about statistical characteristics of hash
functions (I will be more precise in a moment) and I was wondering if
you have some reference to suggest. Even few googling keywords would
be fine; I tried with some obvious choices, but the resulting signal/
noise ratio is very low (e.g., the first page of results of "MD5
statistical" has links to statistical data "signed" with MD5). Note
that since I would not use the hash function against a "malicious"
adversary, but just as a double check of the correctness of a result
(let me skip all the details here, it would be too long), even "old"
and "fragile" functions could be fine, as long as some results are
known.

Let me state more precisely what I mean with "statistical
characteristics". Some questions I am interested to find an answer
are

1) Is the hash function surjective? If it is not, which fraction of
the codomine is covered? [I expect a good hash function to be
surjective or "almost" surjective, but a quantitative result would be
useful]

2) Suppose the message length fixed. How the anti-image of a hash
value is "distributed" over the domain? I expect it to be "spread
out" uniformly all over the domain; for example I expect that if S is
a subset of the domain with |S| elements and h = H(x) is the b-bit
hash of x, then the intersection of S with H^-1(h) has
"approximately" (in some sense to be specified) |S|/2^b elements.

3) What is the minimum Hamming distance between two elements of
H^-1(h)?

I agree that these questions could look a bit vague, but any result
along these lines could be of interest to me.

Thank you for any help
Peter Pearson
2010-02-17 05:49:08 UTC
On 15 Feb 2010 06:26:04 GMT, mockturtle <***@gmail.com> wrote:
[snip]
Post by mockturtle
Let me state more precisely what I mean with "statistical
characteristics". Some questions I am interested to find an answer
are
1) Is the hash function surjective? If it is not, which fraction of
the codomine is covered? [I expect a good hash function to be
surjective or "almost" surjective, but a quantitative result would be
useful]
2) Suppose the message length fixed. How the anti-image of a hash
value is "distributed" over the domain? I expect it to be "spread
out" uniformly all over the domain; for example I expect that if S is
a subset of the domain with |S| elements and h = H(x) is the b-bit
hash of x, then the intersection of S with H^-1(h) has
"approximately" (in some sense to be specified) |S|/2^b elements.
3) What is the minimum Hamming distance between two elements of
H^-1(h)?
I agree that these questions could look a bit vague, but any result
along these lines could be of interest to me.
Thank you for any help
If you're talking about cryptographic hash functions, such as MD5
or the SHA family, the best basis from which to address these
questions is the fact that these hash functions aspire to mimick
a "random oracle", and (as a practical matter) come very close to
succeeding. When you ask a random oracle to produce a 160-bit hash
of some particular input, she looks in her Catalog of Previously
Hashed Inputs. If your input value is there, she returns whatever
she returned the last time that input value occurred; otherwise,
she flips a coin 160 times to construct a new value, returns that
value, and enters it into her Catalog of Previously Hashed Inputs.

The fact that some of these hash functions have been broken
has essentially nothing to do with any general statistical
properties. Naturally, it means that you *can* construct
statistical tests that can distinguish (say) MD5 from a random
oracle, but much intelligence and effort is required to construct
such a test; you wouldn't stumble upon one by accident. (OK, if
somebody's going to get picky, we'll have to explicity exclude
tests like "compare the oracle's output with MD5"...)

questions based on the random oracle model. Addressing the
three questions you listed above would be a straightforward
exercise in statistics. If the resulting predictions fail,
you have a publishable result that would generate excitement
in the cryptology community.

--
To email me, substitute nowhere->spamcop, invalid->net.
Paul Rubin
2010-02-17 06:00:22 UTC
Post by Peter Pearson
If you're talking about cryptographic hash functions, such as MD5
or the SHA family, the best basis from which to address these
questions is the fact that these hash functions aspire to mimick
a "random oracle",
That's for fixed length inputs of course. Otherwise there is a
well-known length extension property.

Thomas Pornin
2010-02-17 05:49:31 UTC
Post by mockturtle
I am searching for results about statistical characteristics of hash
functions
A cryptographically secure hash function is expected to be
indistinguishable from a random oracle. A random oracle is modelled as a
black box with memory: for any given input message, either it was
previously invoked with the very same input message, in which case it
yields the same result than previously, or this is a new message, and
the oracle then selects the output randomly in its output domain, with
uniform probability.

If the function can be distinguished from a random oracle then the
hash function is deemed "weak" and cryptographers refuse to use it.
Corollarily, any "strong" hash function is a hash function for which
no such distinguisher is currently known. This yields some trivial
Post by mockturtle
1) Is the hash function surjective? If it is not, which fraction of
the codomine is covered?
A hash function is supposed to be surjective. But it is also expected
that surjectivity cannot be easily proven; such a proof would require
some knowledge on the hash function structure, something quite close to
a characteristic distinguishing the hash function from a random oracle.
Post by mockturtle
2) Suppose the message length fixed. How the anti-image of a hash
value is "distributed" over the domain?
Then again, distribution should be uniform. If the messages have length
n bits and the output has length b bits, then it is expected that an
average of 2^(n-b) input messages yield any given b-bit output.
Post by mockturtle
3) What is the minimum Hamming distance between two elements of
H^-1(h)?
The minimum distance is 1. The only restriction on the random oracle is
to yield two distinct outputs for two distinct inputs. Since hash
fucntions accepts inputs much longer than their output (e.g. SHA-256
accepts input strings of length 0 to 18446744073709551615 bits, while
outputting 256-bit values), then it is highly improbable that there
exists even a single hash output for which there would not be two
preimages differing by a single bit.

Then again, you would be hard pressed to prove it, and if you could
prove it for a given hash function, then that function would be looked
at with some level of defiance. Strong cryptographic hash functions
should not allow such proofs to be achievable.

--Thomas Pornin