The standard RAG loop retrieves top-k chunks by embedding similarity and feeds them straight to the generator. That works until it doesn’t — and the failure mode is subtle: you get chunks that are semantically adjacent but contextually wrong.
Cross-encoder re-ranking fixes this. Instead of comparing embeddings, you run each candidate chunk against the query through a small classifier that scores relevance directly. The top-3 survivors are almost always the right ones.
Under 100 lines of Python, compatible with any retriever, and the latency hit is smaller than you’d expect. This dispatch has the full implementation, the latency profiling, and why it’s almost always worth the inference cost.
to Becoming an Applied AI Engineer