Cost Per Thousand Tokens (CPT or CPMT) is a standardized pricing metric used by AI providers to charge for language model inference, based on the total number of input and output tokens processed. A token is a sub-word unit of text, roughly equivalent to 0.75 words in English. Providers typically publish separate rates for input (prompt) tokens and output (completion) tokens, with output often being more expensive due to the autoregressive generation process. This metric allows engineers to precisely forecast API expenses by calculating (input_tokens * input_rate) + (output_tokens * output_rate). It is the primary variable in the Total Cost of Ownership (TCO) for agentic systems.
Glossary
Cost Per Thousand Tokens

What is Cost Per Thousand Tokens?
Cost Per Thousand Tokens (CPT) is the fundamental unit of pricing for generative AI and large language model APIs, directly linking computational expense to the volume of text processed.
For agent performance benchmarking, CPT is a critical cost telemetry signal. Observability platforms aggregate token consumption across tool calls, planning cycles, and user sessions to attribute expense. This enables finops analysis, comparing the cost-effectiveness of different models or prompt architectures for a given task. High token usage may indicate inefficient context management or excessive reasoning steps. Monitoring CPT alongside latency and task success rate provides a complete view of agent efficiency, guiding optimization efforts like prompt compression or caching strategies to reduce operational expenditure.
Key Components of Token Cost
Cost Per Thousand Tokens (CPT) is the fundamental unit for pricing large language model inference. Understanding its components is critical for forecasting, budgeting, and optimizing AI agent deployments.
Input vs. Output Pricing
Providers almost always charge separately for input tokens (prompt) and output tokens (completion). Output tokens are typically 2-10x more expensive due to the autoregressive generation process, which is more computationally intensive than simple encoding. For example, a query with a 1,000-token prompt generating a 500-token response incurs separate costs for each segment. This structure incentivizes prompt engineering to reduce context length and constraining max tokens in generation parameters.
Context Window Consumption
The entire submitted context window—including system instructions, few-shot examples, conversation history, and retrieved documents—counts as input tokens. Long contexts with Retrieval-Augmented Generation (RAG) or complex agentic memory dramatically increase cost. For a model with a 128k token context, processing a full session uses 128x the tokens of a 1k session, even if the final output is short. Efficient context management and semantic chunking are essential cost controls.
Model Tier & Capability
Cost scales directly with model size and capability. Pricing tiers are structured as:
- Economy/High-Latency: Cheapest, for batch processing.
- Standard/General Purpose: Balanced cost and speed for most agent tasks.
- Premium/Low-Latency: Highest cost, optimized for real-time user-facing agents. Larger, more capable models (e.g., GPT-4, Claude 3 Opus) command a significant premium over smaller, faster models (e.g., GPT-3.5-Turbo, Claude 3 Haiku). The choice directly impacts both Cost Per Thousand Tokens and end-to-end latency.
Caching & Optimization Techniques
Advanced serving techniques can reduce effective token cost:
- Prompt Caching: Identical prompt prefixes across requests are computed once.
- Continuous Batching: Groups multiple requests to maximize GPU utilization, amortizing overhead.
- Speculative Decoding: Uses a small, fast model to draft tokens verified by a larger model, reducing the large model's workload. While these are often managed by the provider, they explain pricing differences between optimized and standard endpoints. On-premise deployments use these to lower Total Cost of Ownership (TCO).
Embedding & Vision Model Costs
Tokenization and pricing differ for non-text modalities:
- Text Embeddings: Priced per input token, used for vector database indexing and retrieval in RAG flows.
- Vision/Large Multimodal Models (LMMs): Input is often tokenized differently. High-resolution images can be represented as thousands of tokens (e.g., a 1024x1024 image ≈ 1,000+ tokens). This makes multi-modal agent interactions, like analyzing diagrams or screenshots, significantly more expensive than pure text.
Tool Calling & Function Execution
When an agent uses tool calling (e.g., OpenAI's function calling, Model Context Protocol), the structured function definitions in the prompt are counted as input tokens. The model's output containing the function call arguments is counted as output tokens. Complex agents with extensive toolkits incur a persistent overhead in every interaction, as the tool schemas must remain in context. This makes agentic telemetry for cost attribution to specific tool calls essential.
CPT Pricing Models Across Major Providers
A comparison of Cost Per Thousand (CPT) token pricing for input (prompt) and output (completion) across leading cloud AI inference platforms, as of Q2 2024. Prices are for standard on-demand inference and exclude committed-use discounts or fine-tuned model variants.
| Model / Tier | OpenAI (GPT-4) | Anthropic (Claude 3 Opus) | Google (Gemini 1.5 Pro) | Meta (Llama 3 70B via Groq) | Mistral AI (Mistral Large) |
|---|---|---|---|---|---|
Input Token Price (per 1K) | $0.03 | $0.075 | $0.000125 | $0.00059 | $0.002 |
Output Token Price (per 1K) | $0.06 | $0.375 | $0.000375 | $0.00079 | $0.006 |
Output/Input Price Ratio | 2.0x | 5.0x | 3.0x | 1.34x | 3.0x |
Context Window (Tokens) | 128K | 200K | 1M | 8K (typical) | 32K |
Typical P50 Latency | 300-500ms | 2-4s | 1-3s | < 100ms | 200-400ms |
Typical P99 Latency | 1-2s | 8-12s | 5-8s | < 200ms | 1-2s |
Minimum Charge per Request | |||||
Batch Inference Discount |
Frequently Asked Questions
Cost Per Thousand Tokens (CPT) is the fundamental unit of pricing for generative AI and large language model APIs. These questions address how it's calculated, its impact on total cost of ownership, and strategies for optimization.
Cost Per Thousand Tokens (CPT) is a standardized pricing metric used by AI cloud providers to charge for language model inference, based on the volume of input (prompt) and output (completion) tokens processed. It is calculated by summing the token count for a given request, dividing by 1,000, and multiplying by the provider's published rate for the specific model. For example, if a model costs $0.50 per 1K input tokens and $1.50 per 1K output tokens, a request with a 500-token prompt and a 200-token response would cost (500/1000 * $0.50) + (200/1000 * $1.50) = $0.25 + $0.30 = $0.55.
Providers typically publish separate rates for input and output tokens, as generating text (output) is computationally more intensive than reading it (input). This granular billing makes Agent Cost Telemetry—tracking and attributing token usage to specific sessions, users, or features—a critical engineering practice for financial control.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Cost Per Thousand Tokens is a core operational metric for AI systems. These related terms define the broader framework for measuring, managing, and optimizing agent performance and cost.
Agent Cost Telemetry
The specialized observability practice of tracking and attributing computational and financial costs to individual agent sessions, actions, or users. This involves instrumenting systems to capture granular data on token consumption, API call expenses, and compute runtime. Key outputs include:
- Per-session cost breakdowns for user-facing billing.
- Attribution of costs to specific tools or external service calls.
- Identification of high-cost agent behaviors or prompt patterns for optimization.
- Integration with FinOps dashboards for budget forecasting and showback.
Total Cost of Ownership (TCO)
A comprehensive financial assessment of deploying and operating an AI agent system over its entire lifecycle. For agentic systems, TCO extends beyond cloud API fees (Cost Per Thousand Tokens) to include:
- Infrastructure costs for hosting, networking, and data storage.
- Development and integration costs for building orchestration logic and tool integrations.
- Maintenance costs for monitoring, model updates, and prompt engineering.
- Indirect costs related to data governance, security, and compliance efforts. This holistic view is essential for CTOs to justify investments and compare build-vs-buy strategies.
Tokens Per Second (TPS)
A core throughput metric that quantifies the raw inference speed of a language model or AI agent, measured in output tokens generated per second. TPS is inversely related to latency and directly impacts Cost Per Thousand Tokens under time-based cloud pricing models. Key considerations include:
- Hardware dependency: TPS varies significantly based on GPU/accelerator type and batch size.
- Model architecture impact: Smaller, optimized models (e.g., Small Language Models) typically achieve higher TPS.
- Trade-off with quality: Techniques to boost TPS, like aggressive quantization, can affect output accuracy. Engineering leaders use TPS for capacity planning and evaluating the cost-efficiency of different model serving options.
Resource Utilization
A metric measuring the percentage of available system hardware resources—such as GPU, CPU, memory, and vRAM—consumed by an AI workload. High utilization indicates efficient use of expensive infrastructure, directly lowering the effective Total Cost of Ownership. Monitoring involves:
- Tracking GPU utilization during inference and training batches.
- Identifying memory bottlenecks that cause swapping and slow down token generation.
- Using metrics to right-size cloud instances or Kubernetes pod requests/limits. Poor utilization often points to performance bottlenecks in the serving stack or inefficient batching strategies.
Inference Optimization
The engineering discipline focused on reducing the computational cost and latency of executing trained models. Techniques directly lower the effective Cost Per Thousand Tokens by making token generation cheaper and faster. Core methods include:
- Continuous batching: Dynamically grouping incoming requests to maximize GPU throughput.
- KV Cache optimization: Efficiently managing the attention key-value cache to reduce memory bandwidth.
- Model compression: Applying post-training quantization and pruning to shrink model size.
- Kernel fusion: Using custom, low-level GPU kernels to minimize operational overhead. This pillar is critical for CTOs mandating infrastructure cost control.
Performance Baseline
An established set of metric values that defines the expected normal operating performance of an AI system, used as a reference for detecting regressions or improvements. For cost-aware systems, a baseline includes:
- Expected Cost Per Thousand Tokens for standard query types.
- Target Tokens Per Second (TPS) and latency distributions.
- Normal ranges for Resource Utilization.
- Standard Task Success Rate and Accuracy. Deviations from this baseline, known as performance regressions, trigger investigations into whether cost increases are justified by quality improvements or are due to inefficiencies.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us