The TextTiling algorithm is an unsupervised method for segmenting a long document into coherent, multi-paragraph topical units by analyzing patterns of lexical cohesion. It operates by sliding a fixed-size window across the text, calculating a similarity score—typically using term frequency or cosine similarity—between adjacent blocks of sentences. Local minima in this similarity curve are identified as topic boundaries, where the vocabulary shifts significantly, indicating a change in subject matter. This approach is foundational for semantic chunking in information retrieval systems.
