Inferensys

Integration

AI Integration with Weights and Biases Custom Charts

Build custom visualization panels in W&B to monitor unique LLM metrics, such as token usage per conversation turn, tool call success rates, or sentiment trends in generated content.
Hardware engineer integrating LLM with IoT sensors, circuit boards on desk, soldering iron nearby, maker lab aesthetic.
FROM BLACK BOX TO BUSINESS DASHBOARD

Why Custom W&B Charts Are Critical for LLM Operations

Standard LLM metrics don't capture the operational and financial realities of running agents and RAG in production.

Out-of-the-box dashboards in Weights & Biases track generic ML metrics like loss and accuracy, but production LLM applications demand business-aware visualizations. Teams need to monitor token consumption per conversation turn to correlate cost spikes with specific user intents or agent tool calls. They need to track tool call success rates to identify flaky APIs that degrade user experience, and visualize sentiment trends in generated content to preempt brand safety issues. Without custom charts, you're flying blind on what matters: unit economics, workflow reliability, and output quality.

Building these panels requires instrumenting your LangChain or custom application to log structured events to W&B's wandb.log(). For example, each agent step should emit a custom metric like tool_call_latency_ms and tool_success_bool. A RAG retrieval event should log retrieved_chunk_relevance_score and context_token_count. These events become dimensions you can slice in W&B's custom chart builder to create operational dashboards that answer specific questions: "Which customer segment is driving the highest cost per resolved ticket?" or "Is our new embedding model improving retrieval accuracy for technical support queries?"

Rolling out custom charts follows a governance workflow: First, define the key business metrics with product and finance stakeholders. Next, instrument the application code, often using LangChain callbacks or middleware. Then, build the charts in a shared W&B project with appropriate RBAC, treating them as versioned assets. Finally, integrate alerts from these charts into your on-call system (e.g., PagerDuty) for anomalies like a 50% spike in token usage. This transforms W&B from an experiment tracker into the system of record for AI operations, providing the granular visibility needed to scale LLM applications confidently and cost-effectively.

ARCHITECTURE SURFACES

Where Custom Charts Plug Into the W&B and LLM Stack

Logging Custom Metrics from LLM Inference

Custom charts are most valuable when visualizing metrics specific to agentic or RAG workflows, which aren't captured by default. Integrate the W&B SDK into your LangChain or custom application code to log these metrics at runtime.

Key Integration Points:

  • Agent Tool Calls: Log success/failure rates, latency, and cost per tool (e.g., database query, API call).
  • RAG Retrieval: Track chunk relevance scores, retrieval latency, and hit/miss rates for your vector store queries.
  • Conversation Analysis: Log per-turn metrics like sentiment drift, token usage, or custom safety scores.

This data flows into W&B runs, where custom charts on the run page provide immediate, experiment-level visibility for developers debugging specific agent behaviors.

W&B CUSTOM VISUALIZATION

High-Value Custom Chart Use Cases for LLM Governance

Custom charts in Weights & Biases transform raw LLM telemetry into actionable operational dashboards. These visualizations enable teams to monitor unique business metrics, detect subtle performance issues, and govern complex AI workflows beyond standard accuracy and latency.

01

Token Cost Attribution by Conversation Turn

Track cumulative API costs across multi-turn agent sessions. Visualize which conversation stages (e.g., initial query, tool call, follow-up) consume the most tokens, enabling optimization of prompt design and agent reasoning steps to reduce spend without impacting quality.

Batch -> Real-time
Cost visibility
02

Tool Call Success & Error Rate Trends

Monitor the reliability of external API integrations used by LLM agents. Chart success rates, error types (timeout, auth, malformed request), and latency for each tool (e.g., Salesforce API, database query). Identify brittle dependencies for engineering prioritization.

1 sprint
MTTR reduction
03

Retrieval Relevance Score Distribution

Analyze the quality of document chunks retrieved by RAG systems. Plot the distribution of similarity scores between user queries and retrieved content. Spot degradation in embedding performance or chunking strategy, triggering re-indexing workflows.

Hours -> Minutes
Issue detection
04

User Feedback Sentiment vs. Model Confidence

Correlate LLM confidence scores (logprobs) with post-interaction user feedback (thumbs up/down). Visualize discrepancies to identify overconfident but incorrect answers or underconfident high-quality responses, guiding prompt calibration and guardrail adjustments.

05

PII Detection & Redaction Compliance Tracking

Govern data privacy by charting the volume and types of Personally Identifiable Information detected in LLM inputs/outputs. Monitor redaction effectiveness over time and segment by model or user group to ensure compliance with internal policies and regulations like GDPR.

Same day
Audit readiness
06

Multi-Model Performance & Cost Comparison

Build a custom radar or parallel coordinates chart comparing multiple LLM providers (OpenAI GPT-4, Anthropic Claude, Cohere) across dimensions: cost per 1k tokens, task-specific accuracy, latency, and hallucination rate. Use for runtime routing decisions and vendor strategy.

Batch -> Real-time
Routing logic
MONITORING UNIQUE LLM METRICS

Example Workflows: From Data to Custom Dashboard

These workflows demonstrate how to instrument LLM applications to log custom metrics, visualize them in Weights & Biases custom charts, and trigger operational alerts. Each example connects a specific LLM use case to a W&B panel for engineering and product oversight.

Trigger: A user message is processed by a conversational agent (e.g., a customer support bot).

Context/Data Pulled: The application logs the following per turn:

  • session_id
  • user_turn_count
  • model_used (e.g., gpt-4-turbo, claude-3-sonnet)
  • prompt_tokens
  • completion_tokens
  • total_tokens
  • estimated_cost (calculated using provider's per-token pricing)

Model/Agent Action: The LLM generates a response. The application's custom callback handler or wrapper captures the token counts from the provider's API response.

System Update: The metrics are sent to W&B as a custom run log using wandb.log().

python
# Example log call within your application
wandb.log({
    "turn/total_tokens": total_tokens,
    "turn/estimated_cost_usd": estimated_cost,
    "turn/model": model_used,
    "session_id": session_id
}, step=user_turn_count)

Dashboard & Alert: A custom W&B Line Plot panel is configured to:

  1. Display average turn/total_tokens and turn/estimated_cost_usd over time, segmented by turn/model.
  2. Set a W&B alert to trigger a Slack notification if the 7-day rolling average cost per conversation exceeds a defined budget threshold.

Human Review Point: Anomalous spikes in token usage trigger a review of recent conversation logs to identify prompt inefficiencies or unexpected user behavior.

FROM LLM INFERENCE TO CUSTOM DASHBOARDS

Implementation Architecture: Data Flow, APIs, and Model Layer

A production-ready architecture for streaming custom LLM metrics from your application to Weights & Biases for visualization and alerting.

The integration connects your live LLM application to W&B's logging API (wandb.log) to stream custom metrics in real-time. For a conversational agent, you would instrument key points in the code—such as after each user turn—to capture dimensions like tokens_used, tool_calls_attempted, tool_success_rate, and derived metrics like cost_per_turn. This data is sent as a dictionary payload to a dedicated W&B run, often initialized at the start of a user session or batch job. The W&B SDK handles batching and network retries, ensuring telemetry capture doesn't block your primary application workflow.

Within the W&B interface, you use the Custom Charts panel builder to create visualizations from this logged data. For monitoring token usage, you might build a line chart grouping tokens_used by conversation_id and model_variant. For tool reliability, a bar chart could display tool_success_rate segmented by tool_name. These panels are then composed into a dedicated dashboard, providing a real-time operational view. Crucially, you can set alerts on these custom metrics (e.g., "alert if tool_success_rate < 95% over last 100 calls") that trigger via Slack, PagerDuty, or webhooks to your incident management system.

Governance is enforced at the data layer. Before logging, your application should strip any sensitive data (PII, PHI) from the metric payloads. Access to the W&B project and dashboard is controlled via W&B's RBAC and SSO integration, ensuring only authorized MLOps and engineering team members can view detailed operational data. For audit purposes, the entire metric lineage—from the application code commit that generated it to the specific W&B run and panel where it's displayed—is preserved, which is critical for debugging and compliance reviews. This architecture turns opaque LLM operations into a governed, observable system, enabling teams to optimize costs, reliability, and user experience based on empirical data.

WEIGHTS & BIASES INTEGRATION

Code Examples: Logging Custom Metrics and Building Panels

Tracking Token Usage Per Conversation Turn

To monitor cost and efficiency, you can log custom metrics for each turn in a multi-turn LLM conversation. This example logs the total tokens used and the conversation turn number for each interaction, enabling analysis of cost escalation in long-running sessions.

python
import wandb
from openai import OpenAI

# Initialize W&B run
run = wandb.init(project="llm-conversation-monitoring", job_type="inference")

client = OpenAI()
conversation_history = []

for turn in range(1, 6):  # Simulating a 5-turn conversation
    user_message = f"User query for turn {turn}"
    conversation_history.append({"role": "user", "content": user_message})
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=conversation_history,
        max_tokens=150
    )
    
    assistant_reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    
    # Log custom metrics for this turn
    run.log({
        "conversation_turn": turn,
        "total_tokens": response.usage.total_tokens,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "cumulative_tokens": sum([r.usage.total_tokens for r in responses_log])
    })

run.finish()

This creates a time-series plot in W&B showing token consumption growth, helping identify conversations that become inefficient.

MONITORING CUSTOM LLM METRICS

Operational Impact: Time Saved and Risk Reduced

How building custom W&B dashboards for unique LLM metrics accelerates troubleshooting and reduces operational risk.

MetricBefore AIAfter AINotes

Time to diagnose retrieval failure

Hours of log analysis

Minutes via dashboard alerts

Custom chart tracks tool call success rates and chunk relevance scores

Model cost anomaly detection

Monthly invoice review

Real-time spend dashboards

Custom panel visualizes token usage per user or conversation turn

Performance regression review

Ad-hoc data pulls and manual comparison

Automated A/B test dashboards

Statistical comparison of new vs. baseline model across custom metrics

Identifying biased output segments

Manual sampling and review

Segmented analysis by user cohort

Custom chart slices sentiment or toxicity scores by demographic metadata

Prompt template effectiveness

Qualitative user feedback

Quantified prompt version performance

Dashboard tracks business KPIs (e.g., conversion) linked to prompt hash

Data drift impact assessment

Reactive investigation after user reports

Proactive correlation of drift to KPIs

Custom chart overlays input distribution shifts with accuracy metrics

Stakeholder reporting on AI health

Manual slide deck creation

Automated, shareable report generation

Executive dashboard consolidates custom health scores from multiple charts

CONTROLLED VISUALIZATION DEPLOYMENT

Governance and Phased Rollout Strategy

A structured approach to deploying and governing custom W&B charts ensures your LLM monitoring delivers actionable insights without creating dashboard sprawl or compliance gaps.

Begin by integrating the W&B SDK into your core LLM inference and evaluation pipelines to log custom metrics like token_usage_per_turn, tool_call_success_rate, or sentiment_score. Treat these custom panels as versioned assets—store their configuration (queries, aggregations, chart types) in Git alongside your prompt templates and evaluation code. This allows you to track changes, roll back problematic visualizations, and maintain a clear lineage between a metric's definition and the data it displays.

Adopt a phased rollout: first, deploy charts to a development project in W&B for data science and engineering teams to validate metric accuracy and query performance. Next, promote a curated set to a staging project shared with product and operations stakeholders to align on KPIs and alert thresholds. Finally, lock down the production project with strict RBAC, ensuring only authorized users can modify core dashboards while view-only access is granted to broader business teams. Use W&B's reporting features to automate snapshot distribution for executive reviews.

Govern these visualizations by linking them to specific business objectives and LLM use cases. For example, a chart tracking hallucination_rate_by_document_source should have a defined owner, a documented review process for spikes, and a clear integration with your incident management system (e.g., PagerDuty). Implement data retention policies within W&B to automatically archive old runs, controlling costs and ensuring compliance. This structured approach transforms custom charts from ad-hoc analysis tools into governed components of your AI operations, providing reliable visibility for model performance, cost control, and regulatory reporting.

W&B CUSTOM CHARTS

FAQ: Technical and Commercial Questions

Building custom visualization panels in Weights & Biases for unique LLM metrics requires careful planning around data collection, chart logic, and operational integration. Below are answers to common technical and commercial questions from teams implementing this capability.

To build meaningful custom charts, you must instrument your application to log specific, structured data points to W&B. This typically involves extending your existing logging calls.

Core Data Points to Log:

  • Conversation/Session ID: A unique identifier to group related turns.
  • Turn/Timestamp: The sequence number or timestamp of each user/assistant exchange.
  • Token Usage: Breakdown of prompt, completion, and total tokens per turn, per model provider.
  • Tool Call Data: For agentic workflows, log the tool name, input parameters, success/failure status, execution duration, and any error messages.
  • Custom Metrics: Pre-calculated scores like sentiment (using a lightweight model), relevance score (from a retrieval step), or a binary flag for a business outcome (e.g., resolved: true).
  • Metadata: User cohort, model variant, prompt version, and deployment environment.

Example W&B Log Call (Python SDK):

python
import wandb

# Log a single conversation turn
wandb.log({
    "conversation_id": "conv_abc123",
    "turn": 5,
    "prompt_tokens": 120,
    "completion_tokens": 85,
    "total_tokens": 205,
    "tool_calls": [
        {"name": "get_weather", "success": True, "duration_ms": 450}
    ],
    "calculated_sentiment": 0.8,
    "user_tier": "enterprise",
    "model": "gpt-4-turbo"
})

Without this granular, per-event logging, custom charts will lack the necessary data dimensions for useful analysis.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.