Inferensys

Integration

AI Integration for LangChain Callback Handlers

Build custom LangChain callback handlers to stream token usage, intermediate steps, and cost data to monitoring platforms like Arize AI and Weights & Biases for fine-grained LLM observability.
ML engineer developing custom LLM, model architecture diagrams on screens, technical deep work environment.
ARCHITECTURE FOR PRODUCTION AI

Where Callback Handlers Fit in Your LLM Observability Stack

LangChain callback handlers are the instrumentation layer that connects your agentic workflows to enterprise-grade monitoring platforms like Arize AI and Weights & Biases.

In a production LLM stack, callback handlers act as the telemetry gateway. They are hooks you attach to your LangChain Chains, Agents, and Tools to stream granular execution data—token counts, intermediate steps, tool call inputs/outputs, retrieval scores, and latency—to your chosen observability backend. This is not just logging; it's the foundation for cost attribution, performance debugging, and governance. Without this layer, your AI operations team is flying blind, unable to trace a problematic customer response back to a specific prompt, model, or retrieved document chunk.

A robust implementation wires these handlers to capture data at key lifecycle events: on_llm_start, on_llm_end, on_tool_start, on_chain_end. This data is then batched and sent via the monitoring platform's SDK (e.g., Arize's phoenix client or W&B's wandb logger) to a centralized service. The critical architectural decision is what to capture and at what granularity. For a support agent, you might log the final answer and the top-3 retrieved knowledge base snippets. For a financial analysis chain, you must log the sequence of tool calls to an internal API for a complete audit trail. This data model directly informs your dashboards for SLOs like latency p95, cost per session, and retrieval precision.

Rollout requires a phased approach. Start by instrumenting a single, high-value chain in a non-critical environment. Use the callback data to establish a performance baseline. Then, gradually expand coverage, ensuring your handlers are non-blocking (using async or background threads) to avoid adding latency to user-facing requests. Governance is enforced here: callback handlers can be configured to redact PII before data leaves your environment and to sample high-volume, low-risk interactions to manage observability costs. This layer becomes the single source of truth for why an AI agent made a decision, enabling root cause analysis when metrics in Arize or W&B indicate drift or degradation.

Ultimately, treating callback handlers as first-class, versioned components of your AI application—not an afterthought—is what separates a prototype from a governed production system. They enable the closed-loop feedback required to iterate on prompts, fine-tune retrieval strategies, and prove compliance. For teams scaling beyond a few chains, we recommend implementing a unified callback manager that routes data to multiple destinations (e.g., Arize for monitoring, W&B for experiment comparison, and an internal data lake for custom analytics) based on environment and tags.

LANGCHAIN INTEGRATION

Callback Handler Integration Points Across the LLM Lifecycle

Instrumenting LangChain for Experiment Tracking

During development, custom callback handlers capture granular telemetry from LangChain runs—prompts, completions, token usage, and intermediate chain steps—and stream them to platforms like Weights & Biases or Arize AI. This creates a complete experiment lineage, linking model outputs to specific code commits, prompt versions, and hyperparameters.

python
from langchain.callbacks import wandb_callback
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()
with wandb_callback.WandbCallback() as cb:
    result = llm.invoke("Explain RAG.")
    # Logs prompt, completion, tokens, latency to W&B

Integrating callbacks here enables data scientists to compare model performance, attribute costs to experiments, and ensure reproducible research before promoting a chain to staging.

LANGCHAIN INTEGRATION PATTERNS

High-Value Use Cases for Custom Callback Handlers

Custom LangChain callback handlers are the critical integration point for streaming LLM telemetry—token usage, intermediate steps, tool calls—to your monitoring and governance platforms. These patterns show where to instrument your agents and chains for production-grade observability.

01

Real-Time Cost Attribution for Multi-Model Agents

Instrument agents that dynamically route between OpenAI GPT-4, Anthropic Claude, and open-source models. A custom handler streams token counts, model names, and timestamps to a data warehouse or cost platform like CloudHealth, enabling per-project, per-team, and per-workload spend tracking.

Workflow: Handler captures on_llm_end events, enriches with metadata (project ID, user), and pushes to a message queue for aggregation.

Batch -> Real-time
Spend visibility
02

Tool Call Auditing for Secure Agent Workflows

Agents calling internal APIs (SQL databases, CRM updates) require an immutable audit trail. A handler logs every on_tool_start and on_tool_end event—including sanitized inputs and outputs—to a security log platform like Splunk or a governed database.

Operational Value: Provides compliance evidence, enables replay for debugging, and detects anomalous tool usage patterns.

Immutable trail
For compliance
03

End-to-End Latency Tracing for RAG Pipelines

Break down total response time for RAG queries. Handlers instrument each stage: on_retriever_start/end (vector search), on_llm_start/end (generation), and on_chain_end (final answer). Stream spans to tracing backends like LangSmith, Datadog, or W&B for a service map.

Workflow: Correlate trace IDs across handlers to identify bottlenecks (e.g., slow retrieval vs. slow LLM calls).

Pinpoint bottlenecks
In < 1 sprint
04

Streaming Intermediate Steps to Arize AI for RCA

For complex chains, stream each intermediate step and its result to Arize AI as a custom step event. This enables root cause analysis when a final answer is poor—you can drill down to see which specific retrieval or reasoning step failed.

Operational Value: Cuts debugging time from hours to minutes by isolating failure points in multi-step agentic workflows.

Hours -> Minutes
Debugging time
05

Prompt & Response Sampling for Human Review

Implement a sampling handler that captures a percentage of LLM inputs/outputs based on rules (e.g., low confidence scores, specific topics). Stream these pairs to a review queue in platforms like Labelbox or a custom dashboard for human evaluation and fine-tuning dataset creation.

Workflow: Integrates with LangChain's built-in BaseCallbackHandler to add metadata (chain ID, confidence) before sending to a secure storage bucket.

Targeted sampling
For fine-tuning
06

Embedding Drift Detection for Vector Stores

As source documents evolve, embedding drift can degrade RAG performance. A handler attached to document ingestion pipelines logs statistics (embedding dimensions, mean values) of newly indexed chunks to Arize AI or a custom monitor.

Operational Value: Triggers alerts and re-indexing workflows when embeddings shift beyond a threshold, maintaining retrieval accuracy.

Proactive alerts
Prevent decay
IMPLEMENTATION PATTERNS

Example Workflows: From Callback Event to Actionable Insight

LangChain callback handlers provide hooks into the execution lifecycle of chains, agents, and tools. By streaming this telemetry to platforms like Weights & Biases (W&B) and Arize AI, you can build a production-grade observability layer. Below are concrete workflows for instrumenting, monitoring, and governing LLM applications.

Trigger: A LangChain agent executes a sequence of tool calls (e.g., SQL query, API lookup).

Context Pulled: The custom callback handler captures:

  • run_id, parent_run_id for trace lineage.
  • Tool name, input arguments (sanitized), execution duration.
  • LLM provider (OpenAI, Anthropic), model name, prompt/ completion token counts.

Agent Action: The handler uses the W&B SDK (wandb.log) to stream metrics in real-time to a dedicated W&B run. It structures logs to separate:

  • Cost Metrics: llm/total_tokens, llm/cost_estimate.
  • Performance Metrics: agent/tool_call_latency_ms, agent/steps_per_session.
  • Custom Dimensions: tool_name, model_name, user_id (hashed).

System Update: W&B dashboards visualize cost per session, most-used tools, and token efficiency trends. Alerts can be configured via W&B alerts for anomalous token spikes.

Human Review Point: If cost per session exceeds a defined threshold, the run is tagged and an incident is created in a linked system like Jira for an architect to review agent logic.

FROM LANGCHAIN CALLBACKS TO PRODUCTION MONITORING

Implementation Architecture: Building Reliable Telemetry Pipelines

A practical guide to instrumenting LangChain applications for observability, cost tracking, and performance governance.

LangChain's callback system provides hooks into the execution lifecycle of chains, agents, and tools. To build a reliable telemetry pipeline, you instrument these handlers to stream key events—like token usage, tool execution, intermediate steps, and final outputs—to a central monitoring platform such as Weights & Biases or Arize AI. This involves mapping LangChain's internal execution graph (the sequence of LLM calls, retrievers, and tools) to a trace structure that platforms like LangSmith or W&B can ingest, providing a unified view of cost, latency, and success rates across complex, multi-step workflows.

A production implementation typically involves a custom BaseCallbackHandler that batches and asynchronously dispatches events to avoid adding latency to the user-facing request. Key data points to capture include: model_identifier, input_tokens, output_tokens, step_duration, tool_name, retrieved_document_ids, and any structured output or parsing errors. This data is then enriched with business context (e.g., user_id, session_id, workflow_type) before being sent via the monitoring platform's SDK or REST API. For high-volume applications, you may introduce a lightweight queue (e.g., Redis or an in-memory buffer) to decouple the telemetry emission from the primary request path.

Governance and rollout require careful planning. Start by instrumenting non-critical internal workflows to validate data fidelity and establish baselines for token consumption and latency. Implement feature flags to control telemetry sampling rates, allowing you to manage cost and load on your monitoring infrastructure. Crucially, integrate this pipeline with your existing alerting systems (e.g., PagerDuty, Slack) to notify on anomalies like cost spikes per user, elevated failure rates in tool calling, or degradation in retrieval accuracy for your RAG pipelines. This architecture transforms LangChain from a prototyping framework into a governed, observable production system.

LANGCHAIN INTEGRATION BLUEPRINTS

Code Examples: Custom Handler Patterns for Arize AI and W&B

Streaming LLM Telemetry to Arize AI

This handler captures token usage, latency, and model responses from LangChain runs, structuring them for Arize's log_prediction API. It's essential for monitoring cost per chain and detecting performance degradation in real-time.

python
from langchain.callbacks.base import BaseCallbackHandler
import arize
from arize.utils.types import ModelTypes
import time

class ArizeCallbackHandler(BaseCallbackHandler):
    def __init__(self, arize_client, model_id, model_version):
        self.arize_client = arize_client
        self.model_id = model_id
        self.model_version = model_version
        self.start_time = None
        
    def on_llm_start(self, serialized, prompts, **kwargs):
        self.start_time = time.time()
        
    def on_llm_end(self, response, **kwargs):
        latency_ms = (time.time() - self.start_time) * 1000
        generation = response.generations[0][0]
        
        # Log prediction to Arize
        self.arize_client.log_prediction(
            model_id=self.model_id,
            model_version=self.model_version,
            model_type=ModelTypes.GENERATIVE_LLM,
            prediction_id=str(uuid.uuid4()),
            prediction_label=generation.text,
            features={"prompt": prompts[0]},
            embedding_features={},
            prompt=prompts[0],
            response=generation.text,
            token_usage={
                "input_tokens": response.llm_output.get('token_usage', {}).get('prompt_tokens', 0),
                "output_tokens": response.llm_output.get('token_usage', {}).get('completion_tokens', 0)
            },
            latency=latency_ms
        )

This pattern enables per-chain cost attribution and establishes a baseline for prompt performance, which is critical for optimizing expensive GPT-4 or Claude workloads.

LANGCHAIN LLMOPS

Operational Impact: Before and After Custom Callback Integration

How custom LangChain callback handlers transform development, monitoring, and governance workflows by streaming fine-grained telemetry to platforms like Weights & Biases and Arize AI.

MetricBefore AIAfter AINotes

Experiment Tracking

Manual logging in spreadsheets or local files

Automated, versioned logging to W&B for every run

Enables reproducible research and team collaboration across all LLM experiments

Cost Attribution

Monthly API bill with no project/team breakdown

Real-time token usage per chain, agent, and project

Allows FinOps tracking and identifies expensive workflows for optimization

Debugging Agent Failures

Sifting through application logs to trace tool-calling errors

Visual trace of intermediate steps and tool inputs/outputs in Arize

Reduces MTTR for complex multi-agent issues from hours to minutes

Performance Monitoring

Periodic manual checks on latency and error rates

Dashboards with automated drift detection and SLO alerts

Proactive identification of degradation before user impact

Model Governance

Spreadsheet-based model registry and manual approval gates

W&B Model Registry with automated lineage from code to deployment

Enforces version control and provides audit trail for compliance

Prompt Management

Hard-coded prompts or environment variables, manual A/B testing

Versioned prompt templates with integrated A/B test results in Arize

Safe, measurable iteration on prompts without code deploys

Root Cause Analysis

Correlating system metrics with user complaints manually

Drill-down from poor output scores to specific problematic retrievals or tool calls

Accelerates optimization of RAG chunking strategies and tool logic

FOR LANGCHAIN CALLBACK HANDLERS

Governance and Phased Rollout Considerations

Integrating LangChain callback handlers with monitoring platforms requires a deliberate approach to data governance, risk management, and controlled deployment.

A production-grade callback handler integration must be treated as a critical observability pipeline. This means implementing safeguards before streaming sensitive data like prompts, completions, and intermediate agent steps to external platforms like Arize AI or Weights & Biases. Key governance steps include:

  • Data Sanitization & PII Scrubbing: Integrate a preprocessing layer within the callback to automatically redact or hash personally identifiable information (PII), API keys, and internal system identifiers before data leaves your environment.
  • Access Control & RBAC: Ensure the callback's API credentials are scoped with the principle of least privilege, granting only the permissions necessary to write telemetry data (e.g., write:traces).
  • Audit Trail: Log all callback initialization events, including the handler configuration and destination platform, to a secure, immutable log for security and compliance reviews.

A phased rollout minimizes risk and allows for performance validation. Start with a shadow deployment, where the callback handler is active but streams data to a dedicated, non-production project in your monitoring platform. This allows you to:

  1. Validate Data Fidelity: Confirm that token counts, step sequences, and custom metrics are captured accurately without impacting application latency.
  2. Assess Volume & Cost: Monitor the data volume generated to forecast monitoring platform costs and ensure your pipeline can handle peak loads.
  3. Refine Sampling Strategies: Implement sampling logic (e.g., log only 10% of traces, or all traces for high-risk workflows) to control costs while maintaining observability for critical paths.

Subsequent phases can enable the handler for specific high-value agent workflows or pilot teams before a full enterprise rollout.

Finally, establish operational runbooks linked to the telemetry data. Define clear alert thresholds for anomalous metrics surfaced by the callback data—such as a spike in token usage per request or a drop in the success rate of tool calls. Integrate these alerts with your incident management platform (e.g., PagerDuty, ServiceNow) and designate an on-call rotation for AI operations. This closed-loop process ensures the callback integration drives actionable intelligence, not just passive monitoring, turning raw LangChain telemetry into a governed asset for reliable AI operations.

LANGCHAIN CALLBACK HANDLERS

FAQ: Technical and Commercial Questions

Common questions from engineering and MLOps teams implementing custom LangChain callback handlers to stream telemetry to platforms like Weights & Biases and Arize AI.

Custom callback handlers can capture a wide range of telemetry. The destination depends on your governance and operational needs:

  • For Experiment Tracking & Collaboration: Stream to Weights & Biases (W&B).
    • on_llm_start: Log the model name, provider, and prompt.
    • on_llm_end: Capture the completion, token usage (input/output), latency, and cost.
    • on_chain_start/end: Log the chain type, inputs, outputs, and any intermediate steps for full traceability.
  • For Production Monitoring & Drift Detection: Stream to Arize AI.
    • on_llm_end: Send prompt, completion, latency, and token counts as inference data.
    • on_tool_start/end: Log tool calls, execution time, and results to monitor external API reliability.
    • Attach ground truth or feedback scores via a separate feedback loop for performance calculation.
  • For Audit Trails & Compliance: Stream to Credo AI or a secure data lake.
    • Capture all inputs/outputs, user IDs, and timestamps to satisfy regulatory inquiries.

Implementation Note: Use a single handler to fan out events to multiple systems, or create separate handlers for isolation. Always batch and async-write where possible to avoid blocking the main application thread.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.