An inverted file index (or inverted index) is a database index that maps content, such as words or tokens, to the documents or data points where that content appears, enabling fast full-text search. It works by inverting the traditional document-term relationship: instead of listing the words in a document, it lists the documents containing each word.
How it works:
- Document Processing: Each document is parsed, tokenized (split into words/terms), and often normalized (lowercasing, stemming).
- Index Construction: For each unique term, the index creates a postings list. This list records the document IDs where the term appears and often includes positional information.
- Query Processing: When a user submits a query (e.g., "machine learning"), the index retrieves the postings lists for "machine" and "learning." It then performs set operations (like intersection for an AND query) to find documents containing all query terms.
This structure allows search engines to skip scanning every document, answering queries in time proportional to the number of matching documents rather than the total corpus size.