Inferensys

Glossary

Cost Per Thousand Tokens

Cost Per Thousand Tokens (CPT) is the standardized pricing metric used by AI providers to charge for language model inference, based on the volume of input and output tokens processed.
ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.
AGENT PERFORMANCE METRIC

What is Cost Per Thousand Tokens?

Cost Per Thousand Tokens (CPT) is the fundamental unit of pricing for generative AI and large language model APIs, directly linking computational expense to the volume of text processed.

Cost Per Thousand Tokens (CPT or CPMT) is a standardized pricing metric used by AI providers to charge for language model inference, based on the total number of input and output tokens processed. A token is a sub-word unit of text, roughly equivalent to 0.75 words in English. Providers typically publish separate rates for input (prompt) tokens and output (completion) tokens, with output often being more expensive due to the autoregressive generation process. This metric allows engineers to precisely forecast API expenses by calculating (input_tokens * input_rate) + (output_tokens * output_rate). It is the primary variable in the Total Cost of Ownership (TCO) for agentic systems.

For agent performance benchmarking, CPT is a critical cost telemetry signal. Observability platforms aggregate token consumption across tool calls, planning cycles, and user sessions to attribute expense. This enables finops analysis, comparing the cost-effectiveness of different models or prompt architectures for a given task. High token usage may indicate inefficient context management or excessive reasoning steps. Monitoring CPT alongside latency and task success rate provides a complete view of agent efficiency, guiding optimization efforts like prompt compression or caching strategies to reduce operational expenditure.

COST PER THOUSAND TOKENS

Key Components of Token Cost

Cost Per Thousand Tokens (CPT) is the fundamental unit for pricing large language model inference. Understanding its components is critical for forecasting, budgeting, and optimizing AI agent deployments.

01

Input vs. Output Pricing

Providers almost always charge separately for input tokens (prompt) and output tokens (completion). Output tokens are typically 2-10x more expensive due to the autoregressive generation process, which is more computationally intensive than simple encoding. For example, a query with a 1,000-token prompt generating a 500-token response incurs separate costs for each segment. This structure incentivizes prompt engineering to reduce context length and constraining max tokens in generation parameters.

02

Context Window Consumption

The entire submitted context window—including system instructions, few-shot examples, conversation history, and retrieved documents—counts as input tokens. Long contexts with Retrieval-Augmented Generation (RAG) or complex agentic memory dramatically increase cost. For a model with a 128k token context, processing a full session uses 128x the tokens of a 1k session, even if the final output is short. Efficient context management and semantic chunking are essential cost controls.

03

Model Tier & Capability

Cost scales directly with model size and capability. Pricing tiers are structured as:

  • Economy/High-Latency: Cheapest, for batch processing.
  • Standard/General Purpose: Balanced cost and speed for most agent tasks.
  • Premium/Low-Latency: Highest cost, optimized for real-time user-facing agents. Larger, more capable models (e.g., GPT-4, Claude 3 Opus) command a significant premium over smaller, faster models (e.g., GPT-3.5-Turbo, Claude 3 Haiku). The choice directly impacts both Cost Per Thousand Tokens and end-to-end latency.
04

Caching & Optimization Techniques

Advanced serving techniques can reduce effective token cost:

  • Prompt Caching: Identical prompt prefixes across requests are computed once.
  • Continuous Batching: Groups multiple requests to maximize GPU utilization, amortizing overhead.
  • Speculative Decoding: Uses a small, fast model to draft tokens verified by a larger model, reducing the large model's workload. While these are often managed by the provider, they explain pricing differences between optimized and standard endpoints. On-premise deployments use these to lower Total Cost of Ownership (TCO).
05

Embedding & Vision Model Costs

Tokenization and pricing differ for non-text modalities:

  • Text Embeddings: Priced per input token, used for vector database indexing and retrieval in RAG flows.
  • Vision/Large Multimodal Models (LMMs): Input is often tokenized differently. High-resolution images can be represented as thousands of tokens (e.g., a 1024x1024 image ≈ 1,000+ tokens). This makes multi-modal agent interactions, like analyzing diagrams or screenshots, significantly more expensive than pure text.
06

Tool Calling & Function Execution

When an agent uses tool calling (e.g., OpenAI's function calling, Model Context Protocol), the structured function definitions in the prompt are counted as input tokens. The model's output containing the function call arguments is counted as output tokens. Complex agents with extensive toolkits incur a persistent overhead in every interaction, as the tool schemas must remain in context. This makes agentic telemetry for cost attribution to specific tool calls essential.

INPUT & OUTPUT TOKEN PRICING

CPT Pricing Models Across Major Providers

A comparison of Cost Per Thousand (CPT) token pricing for input (prompt) and output (completion) across leading cloud AI inference platforms, as of Q2 2024. Prices are for standard on-demand inference and exclude committed-use discounts or fine-tuned model variants.

Model / TierOpenAI (GPT-4)Anthropic (Claude 3 Opus)Google (Gemini 1.5 Pro)Meta (Llama 3 70B via Groq)Mistral AI (Mistral Large)

Input Token Price (per 1K)

$0.03

$0.075

$0.000125

$0.00059

$0.002

Output Token Price (per 1K)

$0.06

$0.375

$0.000375

$0.00079

$0.006

Output/Input Price Ratio

2.0x

5.0x

3.0x

1.34x

3.0x

Context Window (Tokens)

128K

200K

1M

8K (typical)

32K

Typical P50 Latency

300-500ms

2-4s

1-3s

< 100ms

200-400ms

Typical P99 Latency

1-2s

8-12s

5-8s

< 200ms

1-2s

Minimum Charge per Request

Batch Inference Discount

COST PER THOUSAND TOKENS

Frequently Asked Questions

Cost Per Thousand Tokens (CPT) is the fundamental unit of pricing for generative AI and large language model APIs. These questions address how it's calculated, its impact on total cost of ownership, and strategies for optimization.

Cost Per Thousand Tokens (CPT) is a standardized pricing metric used by AI cloud providers to charge for language model inference, based on the volume of input (prompt) and output (completion) tokens processed. It is calculated by summing the token count for a given request, dividing by 1,000, and multiplying by the provider's published rate for the specific model. For example, if a model costs $0.50 per 1K input tokens and $1.50 per 1K output tokens, a request with a 500-token prompt and a 200-token response would cost (500/1000 * $0.50) + (200/1000 * $1.50) = $0.25 + $0.30 = $0.55.

Providers typically publish separate rates for input and output tokens, as generating text (output) is computationally more intensive than reading it (input). This granular billing makes Agent Cost Telemetry—tracking and attributing token usage to specific sessions, users, or features—a critical engineering practice for financial control.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.