In a production LLM stack, callback handlers act as the telemetry gateway. They are hooks you attach to your LangChain Chains, Agents, and Tools to stream granular execution data—token counts, intermediate steps, tool call inputs/outputs, retrieval scores, and latency—to your chosen observability backend. This is not just logging; it's the foundation for cost attribution, performance debugging, and governance. Without this layer, your AI operations team is flying blind, unable to trace a problematic customer response back to a specific prompt, model, or retrieved document chunk.
Integration
AI Integration for LangChain Callback Handlers

Where Callback Handlers Fit in Your LLM Observability Stack
LangChain callback handlers are the instrumentation layer that connects your agentic workflows to enterprise-grade monitoring platforms like Arize AI and Weights & Biases.
A robust implementation wires these handlers to capture data at key lifecycle events: on_llm_start, on_llm_end, on_tool_start, on_chain_end. This data is then batched and sent via the monitoring platform's SDK (e.g., Arize's phoenix client or W&B's wandb logger) to a centralized service. The critical architectural decision is what to capture and at what granularity. For a support agent, you might log the final answer and the top-3 retrieved knowledge base snippets. For a financial analysis chain, you must log the sequence of tool calls to an internal API for a complete audit trail. This data model directly informs your dashboards for SLOs like latency p95, cost per session, and retrieval precision.
Rollout requires a phased approach. Start by instrumenting a single, high-value chain in a non-critical environment. Use the callback data to establish a performance baseline. Then, gradually expand coverage, ensuring your handlers are non-blocking (using async or background threads) to avoid adding latency to user-facing requests. Governance is enforced here: callback handlers can be configured to redact PII before data leaves your environment and to sample high-volume, low-risk interactions to manage observability costs. This layer becomes the single source of truth for why an AI agent made a decision, enabling root cause analysis when metrics in Arize or W&B indicate drift or degradation.
Ultimately, treating callback handlers as first-class, versioned components of your AI application—not an afterthought—is what separates a prototype from a governed production system. They enable the closed-loop feedback required to iterate on prompts, fine-tune retrieval strategies, and prove compliance. For teams scaling beyond a few chains, we recommend implementing a unified callback manager that routes data to multiple destinations (e.g., Arize for monitoring, W&B for experiment comparison, and an internal data lake for custom analytics) based on environment and tags.
Callback Handler Integration Points Across the LLM Lifecycle
Instrumenting LangChain for Experiment Tracking
During development, custom callback handlers capture granular telemetry from LangChain runs—prompts, completions, token usage, and intermediate chain steps—and stream them to platforms like Weights & Biases or Arize AI. This creates a complete experiment lineage, linking model outputs to specific code commits, prompt versions, and hyperparameters.
pythonfrom langchain.callbacks import wandb_callback from langchain_openai import ChatOpenAI llm = ChatOpenAI() with wandb_callback.WandbCallback() as cb: result = llm.invoke("Explain RAG.") # Logs prompt, completion, tokens, latency to W&B
Integrating callbacks here enables data scientists to compare model performance, attribute costs to experiments, and ensure reproducible research before promoting a chain to staging.
High-Value Use Cases for Custom Callback Handlers
Custom LangChain callback handlers are the critical integration point for streaming LLM telemetry—token usage, intermediate steps, tool calls—to your monitoring and governance platforms. These patterns show where to instrument your agents and chains for production-grade observability.
Real-Time Cost Attribution for Multi-Model Agents
Instrument agents that dynamically route between OpenAI GPT-4, Anthropic Claude, and open-source models. A custom handler streams token counts, model names, and timestamps to a data warehouse or cost platform like CloudHealth, enabling per-project, per-team, and per-workload spend tracking.
Workflow: Handler captures on_llm_end events, enriches with metadata (project ID, user), and pushes to a message queue for aggregation.
Tool Call Auditing for Secure Agent Workflows
Agents calling internal APIs (SQL databases, CRM updates) require an immutable audit trail. A handler logs every on_tool_start and on_tool_end event—including sanitized inputs and outputs—to a security log platform like Splunk or a governed database.
Operational Value: Provides compliance evidence, enables replay for debugging, and detects anomalous tool usage patterns.
End-to-End Latency Tracing for RAG Pipelines
Break down total response time for RAG queries. Handlers instrument each stage: on_retriever_start/end (vector search), on_llm_start/end (generation), and on_chain_end (final answer). Stream spans to tracing backends like LangSmith, Datadog, or W&B for a service map.
Workflow: Correlate trace IDs across handlers to identify bottlenecks (e.g., slow retrieval vs. slow LLM calls).
Streaming Intermediate Steps to Arize AI for RCA
For complex chains, stream each intermediate step and its result to Arize AI as a custom step event. This enables root cause analysis when a final answer is poor—you can drill down to see which specific retrieval or reasoning step failed.
Operational Value: Cuts debugging time from hours to minutes by isolating failure points in multi-step agentic workflows.
Prompt & Response Sampling for Human Review
Implement a sampling handler that captures a percentage of LLM inputs/outputs based on rules (e.g., low confidence scores, specific topics). Stream these pairs to a review queue in platforms like Labelbox or a custom dashboard for human evaluation and fine-tuning dataset creation.
Workflow: Integrates with LangChain's built-in BaseCallbackHandler to add metadata (chain ID, confidence) before sending to a secure storage bucket.
Embedding Drift Detection for Vector Stores
As source documents evolve, embedding drift can degrade RAG performance. A handler attached to document ingestion pipelines logs statistics (embedding dimensions, mean values) of newly indexed chunks to Arize AI or a custom monitor.
Operational Value: Triggers alerts and re-indexing workflows when embeddings shift beyond a threshold, maintaining retrieval accuracy.
Example Workflows: From Callback Event to Actionable Insight
LangChain callback handlers provide hooks into the execution lifecycle of chains, agents, and tools. By streaming this telemetry to platforms like Weights & Biases (W&B) and Arize AI, you can build a production-grade observability layer. Below are concrete workflows for instrumenting, monitoring, and governing LLM applications.
Trigger: A LangChain agent executes a sequence of tool calls (e.g., SQL query, API lookup).
Context Pulled: The custom callback handler captures:
run_id,parent_run_idfor trace lineage.- Tool name, input arguments (sanitized), execution duration.
- LLM provider (OpenAI, Anthropic), model name, prompt/ completion token counts.
Agent Action: The handler uses the W&B SDK (wandb.log) to stream metrics in real-time to a dedicated W&B run. It structures logs to separate:
- Cost Metrics:
llm/total_tokens,llm/cost_estimate. - Performance Metrics:
agent/tool_call_latency_ms,agent/steps_per_session. - Custom Dimensions:
tool_name,model_name,user_id(hashed).
System Update: W&B dashboards visualize cost per session, most-used tools, and token efficiency trends. Alerts can be configured via W&B alerts for anomalous token spikes.
Human Review Point: If cost per session exceeds a defined threshold, the run is tagged and an incident is created in a linked system like Jira for an architect to review agent logic.
Implementation Architecture: Building Reliable Telemetry Pipelines
A practical guide to instrumenting LangChain applications for observability, cost tracking, and performance governance.
LangChain's callback system provides hooks into the execution lifecycle of chains, agents, and tools. To build a reliable telemetry pipeline, you instrument these handlers to stream key events—like token usage, tool execution, intermediate steps, and final outputs—to a central monitoring platform such as Weights & Biases or Arize AI. This involves mapping LangChain's internal execution graph (the sequence of LLM calls, retrievers, and tools) to a trace structure that platforms like LangSmith or W&B can ingest, providing a unified view of cost, latency, and success rates across complex, multi-step workflows.
A production implementation typically involves a custom BaseCallbackHandler that batches and asynchronously dispatches events to avoid adding latency to the user-facing request. Key data points to capture include: model_identifier, input_tokens, output_tokens, step_duration, tool_name, retrieved_document_ids, and any structured output or parsing errors. This data is then enriched with business context (e.g., user_id, session_id, workflow_type) before being sent via the monitoring platform's SDK or REST API. For high-volume applications, you may introduce a lightweight queue (e.g., Redis or an in-memory buffer) to decouple the telemetry emission from the primary request path.
Governance and rollout require careful planning. Start by instrumenting non-critical internal workflows to validate data fidelity and establish baselines for token consumption and latency. Implement feature flags to control telemetry sampling rates, allowing you to manage cost and load on your monitoring infrastructure. Crucially, integrate this pipeline with your existing alerting systems (e.g., PagerDuty, Slack) to notify on anomalies like cost spikes per user, elevated failure rates in tool calling, or degradation in retrieval accuracy for your RAG pipelines. This architecture transforms LangChain from a prototyping framework into a governed, observable production system.
Code Examples: Custom Handler Patterns for Arize AI and W&B
Streaming LLM Telemetry to Arize AI
This handler captures token usage, latency, and model responses from LangChain runs, structuring them for Arize's log_prediction API. It's essential for monitoring cost per chain and detecting performance degradation in real-time.
pythonfrom langchain.callbacks.base import BaseCallbackHandler import arize from arize.utils.types import ModelTypes import time class ArizeCallbackHandler(BaseCallbackHandler): def __init__(self, arize_client, model_id, model_version): self.arize_client = arize_client self.model_id = model_id self.model_version = model_version self.start_time = None def on_llm_start(self, serialized, prompts, **kwargs): self.start_time = time.time() def on_llm_end(self, response, **kwargs): latency_ms = (time.time() - self.start_time) * 1000 generation = response.generations[0][0] # Log prediction to Arize self.arize_client.log_prediction( model_id=self.model_id, model_version=self.model_version, model_type=ModelTypes.GENERATIVE_LLM, prediction_id=str(uuid.uuid4()), prediction_label=generation.text, features={"prompt": prompts[0]}, embedding_features={}, prompt=prompts[0], response=generation.text, token_usage={ "input_tokens": response.llm_output.get('token_usage', {}).get('prompt_tokens', 0), "output_tokens": response.llm_output.get('token_usage', {}).get('completion_tokens', 0) }, latency=latency_ms )
This pattern enables per-chain cost attribution and establishes a baseline for prompt performance, which is critical for optimizing expensive GPT-4 or Claude workloads.
Operational Impact: Before and After Custom Callback Integration
How custom LangChain callback handlers transform development, monitoring, and governance workflows by streaming fine-grained telemetry to platforms like Weights & Biases and Arize AI.
| Metric | Before AI | After AI | Notes |
|---|---|---|---|
Experiment Tracking | Manual logging in spreadsheets or local files | Automated, versioned logging to W&B for every run | Enables reproducible research and team collaboration across all LLM experiments |
Cost Attribution | Monthly API bill with no project/team breakdown | Real-time token usage per chain, agent, and project | Allows FinOps tracking and identifies expensive workflows for optimization |
Debugging Agent Failures | Sifting through application logs to trace tool-calling errors | Visual trace of intermediate steps and tool inputs/outputs in Arize | Reduces MTTR for complex multi-agent issues from hours to minutes |
Performance Monitoring | Periodic manual checks on latency and error rates | Dashboards with automated drift detection and SLO alerts | Proactive identification of degradation before user impact |
Model Governance | Spreadsheet-based model registry and manual approval gates | W&B Model Registry with automated lineage from code to deployment | Enforces version control and provides audit trail for compliance |
Prompt Management | Hard-coded prompts or environment variables, manual A/B testing | Versioned prompt templates with integrated A/B test results in Arize | Safe, measurable iteration on prompts without code deploys |
Root Cause Analysis | Correlating system metrics with user complaints manually | Drill-down from poor output scores to specific problematic retrievals or tool calls | Accelerates optimization of RAG chunking strategies and tool logic |
Governance and Phased Rollout Considerations
Integrating LangChain callback handlers with monitoring platforms requires a deliberate approach to data governance, risk management, and controlled deployment.
A production-grade callback handler integration must be treated as a critical observability pipeline. This means implementing safeguards before streaming sensitive data like prompts, completions, and intermediate agent steps to external platforms like Arize AI or Weights & Biases. Key governance steps include:
- Data Sanitization & PII Scrubbing: Integrate a preprocessing layer within the callback to automatically redact or hash personally identifiable information (PII), API keys, and internal system identifiers before data leaves your environment.
- Access Control & RBAC: Ensure the callback's API credentials are scoped with the principle of least privilege, granting only the permissions necessary to write telemetry data (e.g.,
write:traces). - Audit Trail: Log all callback initialization events, including the handler configuration and destination platform, to a secure, immutable log for security and compliance reviews.
A phased rollout minimizes risk and allows for performance validation. Start with a shadow deployment, where the callback handler is active but streams data to a dedicated, non-production project in your monitoring platform. This allows you to:
- Validate Data Fidelity: Confirm that token counts, step sequences, and custom metrics are captured accurately without impacting application latency.
- Assess Volume & Cost: Monitor the data volume generated to forecast monitoring platform costs and ensure your pipeline can handle peak loads.
- Refine Sampling Strategies: Implement sampling logic (e.g., log only 10% of traces, or all traces for high-risk workflows) to control costs while maintaining observability for critical paths.
Subsequent phases can enable the handler for specific high-value agent workflows or pilot teams before a full enterprise rollout.
Finally, establish operational runbooks linked to the telemetry data. Define clear alert thresholds for anomalous metrics surfaced by the callback data—such as a spike in token usage per request or a drop in the success rate of tool calls. Integrate these alerts with your incident management platform (e.g., PagerDuty, ServiceNow) and designate an on-call rotation for AI operations. This closed-loop process ensures the callback integration drives actionable intelligence, not just passive monitoring, turning raw LangChain telemetry into a governed asset for reliable AI operations.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
FAQ: Technical and Commercial Questions
Common questions from engineering and MLOps teams implementing custom LangChain callback handlers to stream telemetry to platforms like Weights & Biases and Arize AI.
Custom callback handlers can capture a wide range of telemetry. The destination depends on your governance and operational needs:
- For Experiment Tracking & Collaboration: Stream to Weights & Biases (W&B).
on_llm_start: Log the model name, provider, and prompt.on_llm_end: Capture the completion, token usage (input/output), latency, and cost.on_chain_start/end: Log the chain type, inputs, outputs, and any intermediate steps for full traceability.
- For Production Monitoring & Drift Detection: Stream to Arize AI.
on_llm_end: Send prompt, completion, latency, and token counts as inference data.on_tool_start/end: Log tool calls, execution time, and results to monitor external API reliability.- Attach ground truth or feedback scores via a separate feedback loop for performance calculation.
- For Audit Trails & Compliance: Stream to Credo AI or a secure data lake.
- Capture all inputs/outputs, user IDs, and timestamps to satisfy regulatory inquiries.
Implementation Note: Use a single handler to fan out events to multiple systems, or create separate handlers for isolation. Always batch and async-write where possible to avoid blocking the main application thread.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us