Locality-Sensitive Hashing (LSH) is a probabilistic hashing algorithm designed to map similar high-dimensional data points to the same hash bucket with high probability, enabling fast approximate nearest neighbor (ANN) search. It works by applying a family of hash functions that are "sensitive" to the distance metric of interest (e.g., cosine similarity, Euclidean distance).
Core Mechanism:
- Hash Function Design: LSH functions are constructed so that the probability of collision (hashing to the same value) is high for nearby points and low for distant points. For cosine similarity, a common LSH family uses random hyperplanes: a hash bit is determined by which side of a random hyperplane a vector falls on (
sign(dot(r, x))).
- Amplification: A single hash function is too noisy. To increase precision, multiple functions are combined into a single "hash signature." To increase recall (finding more true neighbors), multiple independent hash tables are constructed. A query is hashed with all functions, and candidates are retrieved from the matching buckets across all tables.
- Candidate Verification: Retrieved candidates from the hash buckets are then ranked using the exact distance metric (e.g., cosine similarity) to return the final, most similar results.
The process trades perfect accuracy for orders-of-magnitude speed and memory efficiency compared to exhaustive search, making it fundamental for searching large-scale embedding databases.