Reranking is a two-stage information retrieval process designed to improve final result precision by combining the speed of an approximate search with the accuracy of a more powerful model. It works by first using a fast, scalable retriever (like a bi-encoder or keyword search) to fetch a broad set of candidate documents. This initial set is then passed to a slower, more computationally expensive reranker (typically a cross-encoder) which jointly processes the query and each candidate to produce a precise relevance score. The candidates are then re-ordered based on these new scores before being returned to the user or passed to a downstream system like a Large Language Model (LLM).
This architecture is fundamental to Retrieval-Augmented Generation (RAG) systems, where high-quality context is paramount. The first stage ensures low latency, while the second stage acts as a quality filter, dramatically improving the signal-to-noise ratio of the retrieved information.