A token limit is the maximum number of tokens—the basic units of text like words or subwords—that a language model can accept as input and output within a single inference call, defined by its fixed context window. This hard constraint acts as the model's working memory, forcing engineering trade-offs between retaining historical conversation, incorporating new instructions, and generating lengthy responses. Exceeding this limit typically results in an error or automatic context truncation, where tokens are discarded from the sequence.
