Choosing a Good Hash Function - Computer Science at Princeton

Document technical information

Format pdf
Size 235.9 kB
First found May 22, 2018

Document content analysis

Category Also themed
Language
English
Type
not defined
Concepts
no text concepts found

Persons

Joshua Bloch
Joshua Bloch

wikipedia, lookup

Organizations

Places

Transcript

Optimize Judiciously
4.2 Hashing
"More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity." - William A. Wulf
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald E. Knuth
"We follow two rules in the matter of optimization:
Rule 1: Don't do it.
Rule 2 (for experts only). Don't do it yet - that is, not until you have
a perfectly clear and unoptimized solution."
- M. A. Jackson
Reference: Effective Java by Joshua Bloch.
Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226
2
Hashing: Basic Plan.
Choosing a Good Hash Function
Save items in a key-indexed table. Index is a function of the key.
Goal: scramble the keys.
Efficiently computable.
Each table position equally likely for each key.
!
Hash function. Method for computing table index from key.
!
thoroughly researched problem
Collision resolution strategy. Algorithm and data structure to handle
two keys that hash to the same index.
Ex: Social Security numbers.
Bad: first three digits.
Better: last three digits.
!
Classic space-time tradeoff.
No space limitation: trivial hash function with key as address.
No time limitation: trivial collision resolution = sequential search.
Limitations on both time and space: hashing (the real world).
!
573 = California, 574 = Alaska
assigned in chronological order within a
given geographic region
!
Ex: date of birth.
Bad: birth year.
Better: birthday.
!
!
!
!
Ex: phone numbers.
Bad: first three digits.
Better: last three digits.
!
!
3
4
Hash Function: String Keys
hashCode
Java string library hash functions.
Hash code. For any references x and y:
Repeated calls to x.hashCode() must return the same value
provided no information used in equals comparison has changed.
If x.equals(y) then x and y must have the same hash code.
!
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = (31 * hash) + s[i];
return hash;
}
ith character of s
!
!
s
= "call";
h
= s.hashCode();
hash = h % M;
!
"consistent with equals"
3045982
7121
Default implementation: memory address of x.
Customized implementations: String, URL, Integer, Date.
8191
Equivalent to h = 31L-1s0 + . . . + 312sL-3 + 31sL-2 + sL-1.
Horner's method to hash string of length L: O(L).
Q. Can we reliably use (h % M) as index for table of size M?
A. No. Instead, use (h & 0x7fffffff) % M.
5
6
Implementing HashCode: US Phone Numbers
Collisions
Phone numbers: (609) 867-5309.
area code
exchange
Collision = two keys hashing to same value.
Essentially unavoidable.
Birthday problem: how many people will have to enter a room until
two have the same birthday? 23
With M hash values, expect a collision after sqrt(! ! M) insertions.
extension
!
!
!
public final class PhoneNumber {
private final int area, exch, ext;
public PhoneNumber(int area, int exch, int ext) {
this.area = area;
this.exch = exch;
this.ext = ext;
}
Conclusion: can't avoid collisions unless
you have a ridiculous amount of memory.
Challenge: efficiently cope with collisions.
public boolean equals(Object y) { // as before }
}
public int hashCode() {
return 10007 * (area + 1009 * exch) + ext;
}
7
8
Collision Resolution: Two Approaches.
Separate chaining.
M much smaller than N.
" N / M keys per table position.
Put keys that collide in a list.
Need to search lists.
st[0]
jocularly
!
st[1]
listen
!
st[2]
null
!
st[3]
suburban
st[8190]
browsing
Separate Chaining
Separate chaining: array of M linked lists.
Hash: map key to integer i between 0 and M-1.
Insert: put at front of ith chain (if not already there).
Search: only need to search ith chain.
Running time: proportional to length of chain.
seriously
!
!
untravelled
considerating
!
!
!
M = 8191, N = 15000
Open addressing.
M much larger than N.
Plenty of empty table slots.
When a new key collides, find next
empty slot and put it there.
Complex collision patterns.
!
!
!
st[0]
jocularly
st[1]
seriously
st[2]
st[3]
st[0]
jocularly
st[1]
listen
seriously
listen
st[2]
suburban
st[4]
null
st[3]
null
suburban
untravelled
considerating
!
st[5]
st[30001]
null
st[8190]
browsing
M = 30001, N = 15000
hash
7121
me
3480
ishmael
5017
seriously
0
untravelled
3
suburban
3
. . .
.
M = 8191
9
Symbol Table: Separate Chaining
10
Symbol Table: Separate Chaining Implementation (cont)
public class ListHashST<Key, Value> {
private int M = 8191;
private Node[] st = new Node[M];
private static class Node {
Object key;
no generic array creation in Java
Object val;
Node
next;
Node(Object key, Object val, Node next)
this.key = key;
this.val = val;
this.next = next;
}
}
browsing
key
call
public void put(Key key, Value val) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next) {
if (key.equals(x.key)) {
x.val = val;
return;
check if key already present
}
insert at front of chain
}
st[i] = new Node(k, val, st[i]);
}
{
public Value get(Key key) {
int i = hash(k);
for (Node x = st[i]; x != null; x = x.next)
if (key.equals(x.key))
return (Value) x.val;
return null;
}
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
hex
}
between 0 and M-1
11
12
Separate Chaining Performance
Symbol Table: Implementations Cost Summary
Separate chaining performance.
Search cost is proportional to length of chain.
Trivial: average length = N / M.
Worst case: all keys hash to same chain.
!
Worst Case
!
!
Theorem. Let # = N / M > 1 be average length of list. For any t > 1,
probability that list length > t # is exponentially small in t.
Average Case
Implementation
Search
Insert
Delete
Search
Insert
Delete
Sorted array
log N
N
N
log N
N/2
N/2
Unsorted list
N
N
N
N/2
N
N/2
Separate chaining
N
N
N
1*
1*
1*
depends on hash map being random map
* assumes hash function is random
Parameters.
M too large $ too many empty chains.
M too small $ chains too long.
Typical choice: # = N / M " 10 $ constant-time search/insert.
!
!
Advantages: fast insertion, fast search.
Disadvantage: hash table has fixed size.
!
fix: use repeated doubling, and rehash all keys
13
14
Linear Probing
Symbol Table: Linear Probing Implementation
typically twice as many slots as elements
Linear probing: array of size M.
Hash: map key to integer i between 0 and M-1.
Insert: put in slot i if free, if not try i+1, i+2, etc.
Search: search slot i, if occupied but no match, try i+1, i+2, etc.
public class ArrayHashST<Key, Val> {
private int M = 30001;
private Key[] keys = (Key[]) new Object[M];
private Val[] vals = (Val[]) new Object[M];
!
!
!
no generic array
creation in Java
private int hash(Key key) { // as before }
Cluster.
Contiguous block of items.
Search through cluster using elementary algorithm for arrays.
public void put(Key key, Val val) {
int i;
for (i = hash(key); keys[i] != null; i = (i+1) % M)
if (keys[i].equals(key)) break;
keys[i] = key;
vals[i] = val;
}
!
!
public Val get(Key key) {
int i;
for (i = hash(key); keys[i] != null; i = (i+1) % M)
if (keys[i].equals(key)) break;
return vals[i];
}
}
ASEARCHINGXMP
15
16
Linear Probing Performance
Symbol Table: Implementations Cost Summary
Linear probing performance.
Insert and search cost depend on length of cluster.
but elements more likely to
Trivial: average length of cluster = # = N / M.
hash to big clusters
Worst case: all keys hash to same cluster.
Worst Case
!
Average Case
!
!
Theorem. [Knuth 1962] Let # = N / M < 1 be average length of list.
Implementation
Search
Insert
Delete
Search
Insert
Delete
Sorted array
log N
N
N
log N
N/2
N/2
Unsorted list
N
N
N
N/2
N
N/2
Separate chaining
N
N
N
1*
1*
1*
Linear probing
N
N
N
1*
1*
1*
depends on hash map being
random map
* assumes hash function is random
Parameters.
M too large $ too many empty array entries.
M too small $ clusters coalesce.
Typical choice: M " 2N $ constant-time search/insert.
Advantages: fast insertion, fast search.
Disadvantage: hash table has fixed size.
!
!
fix: use repeated doubling, and rehash all keys
!
17
18
Double Hashing
Double Hashing Performance
Double hashing. Avoid clustering by using second hash to compute skip
for search.
Linear probing performance.
Insert and search cost depend on length of cluster.
Trivial: average length of cluster = # = N / M.
Worst case: all keys hash to same cluster.
!
!
Hash. Map key to integer i between 0 and M-1.
Second hash. Map key to nonzero skip value k.
!
Theorem. [Guibas-Szemeredi] Let # = N / M < 1 be average length of list.
Ex: k = 1 + (v mod 97).
depends on hash map being
random map
hashCode
Result. Skip values give different search paths for keys that collide.
Parameters.
M too large $ too many empty array entries.
M too small $ clusters coalesce.
Typical choice: M " 2N $ constant-time search/insert.
Best practices. Make k and M relatively prime.
Disadvantage: delete cumbersome to implement.
!
!
!
19
20
Hashing Tradeoffs
Hash Table: Java Library
Separate chaining vs. linear probing/double hashing.
Space for links vs. empty table slots.
Small table + linked allocation vs. big coherent array.
Java has built-in libraries for symbol tables.
HashMap = linear probing hash table implementation.
!
!
!
import java.util.HashMap;
public class HashMapDemo {
public static void main(String[] args) {
HashMap<String, String> st = new HashMap <String, String>();
st.put("www.cs.princeton.edu", "128.112.136.11");
st.put("www.princeton.edu",
"128.112.128.15");
System.out.println(st.get("www.cs.princeton.edu"));
}
}
Linear probing vs. double hashing.
load factor #
50%
66%
75%
90%
linear
probing
search
1.5
2.0
3.0
5.5
insert
2.5
5.0
8.5
55.5
double
hashing
search
1.4
1.6
1.8
2.6
insert
1.5
2.0
3.0
5.5
Duplicate policy.
Java HashMap allows null values.
Our implementations forbid null values.
!
!
21
22
Symbol Table: Using HashMap
Designing a Good Hash Function
Symbol table. Implement our interface using HashMap.
Java 1.1 string library hash function.
For long strings: only examines 8 evenly spaced characters.
Saves time in performing arithmetic.
Great potential for bad collision patterns.
!
!
!
import java.util.HashMap;
import java.util.Iterator;
public int hashCode() {
int hash = 0;
if (length() < 16) {
for (int i = 0; i < length(); i++)
hash = (37 * hash) + s[i];
}
else {
int skip = length() / 8;
for (int i = 0; i < length(); i += skip)
hash = (37 * hash) + s[i];
}
return hash;
String.java
}
public class ST<Key, Value> implements Iterable<Key> {
private HashMap<Key, Value> st = new HashMap<Key, Value>();
public void put(Key key, Value val) {
if (val == null) st.remove(key);
else
st.put(key, val);
}
public Value get(Key key)
{ return
public Value remove(Key key)
{ return
public boolean contains(Key key) { return
public int size() contains(Key ke{ return
public Iterator<Key> iterator() { return
st.get(key);
st.remove(key);
st.containsKey(key);
st.size();
st.keySet().iterator();
}
}
}
}
}
}
23
24
Algorithmic Complexity Attacks
Algorithmic Complexity Attacks
Is the random hash map assumption important in practice?
Obvious situations: aircraft control, nuclear reactors.
Surprising situations: denial-of-service attacks.
Q. How easy is it to break Java's hashCode with String keys?
A. Almost trivial: string hashCode is part of Java 1.5 API.
Ex: hashCode of "BB" equals hashCode of "Aa".
Can now create 2N strings of length 2N that all hash to same value!
!
!
!
!
malicious adversary learns your ad hoc hash function
(e.g., by reading Java API) and causes a big pile-up in
single address that grinds performance to a halt
AaAaAaAa
AaAaAaBB
AaAaBBAa
AaAaBBBB
AaBBAaAa
AaBBAaBB
AaBBBBAa
AaBBBBBB
Real-world exploits. [Crosby-Wallach 2003]
Bro server: send carefully chosen packets to DOS the server, using
less bandwidth than a dial-up modem
Perl 5.8.0: insert carefully chosen strings into associative array.
Linux 2.4.20 kernel: save files with carefully chosen names.
!
Possible to fix?
Security by obscurity. [not recommended]
Cryptographically secure hash functions.
Universal hashing.
!
!
Reference:
BBAaAaAa
BBAaAaBB
BBAaBBAa
BBAaBBBB
BBBBAaAa
BBBBAaBB
BBBBBBAa
BBBBBBBB
!
!
http://www.cs.rice.edu/~scrosby/hash
!
25
26
×

Report this document