Out-of-the-box dashboards in Weights & Biases track generic ML metrics like loss and accuracy, but production LLM applications demand business-aware visualizations. Teams need to monitor token consumption per conversation turn to correlate cost spikes with specific user intents or agent tool calls. They need to track tool call success rates to identify flaky APIs that degrade user experience, and visualize sentiment trends in generated content to preempt brand safety issues. Without custom charts, you're flying blind on what matters: unit economics, workflow reliability, and output quality.
Integration
AI Integration with Weights and Biases Custom Charts

Why Custom W&B Charts Are Critical for LLM Operations
Standard LLM metrics don't capture the operational and financial realities of running agents and RAG in production.
Building these panels requires instrumenting your LangChain or custom application to log structured events to W&B's wandb.log(). For example, each agent step should emit a custom metric like tool_call_latency_ms and tool_success_bool. A RAG retrieval event should log retrieved_chunk_relevance_score and context_token_count. These events become dimensions you can slice in W&B's custom chart builder to create operational dashboards that answer specific questions: "Which customer segment is driving the highest cost per resolved ticket?" or "Is our new embedding model improving retrieval accuracy for technical support queries?"
Rolling out custom charts follows a governance workflow: First, define the key business metrics with product and finance stakeholders. Next, instrument the application code, often using LangChain callbacks or middleware. Then, build the charts in a shared W&B project with appropriate RBAC, treating them as versioned assets. Finally, integrate alerts from these charts into your on-call system (e.g., PagerDuty) for anomalies like a 50% spike in token usage. This transforms W&B from an experiment tracker into the system of record for AI operations, providing the granular visibility needed to scale LLM applications confidently and cost-effectively.
Where Custom Charts Plug Into the W&B and LLM Stack
Logging Custom Metrics from LLM Inference
Custom charts are most valuable when visualizing metrics specific to agentic or RAG workflows, which aren't captured by default. Integrate the W&B SDK into your LangChain or custom application code to log these metrics at runtime.
Key Integration Points:
- Agent Tool Calls: Log success/failure rates, latency, and cost per tool (e.g., database query, API call).
- RAG Retrieval: Track chunk relevance scores, retrieval latency, and hit/miss rates for your vector store queries.
- Conversation Analysis: Log per-turn metrics like sentiment drift, token usage, or custom safety scores.
This data flows into W&B runs, where custom charts on the run page provide immediate, experiment-level visibility for developers debugging specific agent behaviors.
High-Value Custom Chart Use Cases for LLM Governance
Custom charts in Weights & Biases transform raw LLM telemetry into actionable operational dashboards. These visualizations enable teams to monitor unique business metrics, detect subtle performance issues, and govern complex AI workflows beyond standard accuracy and latency.
Token Cost Attribution by Conversation Turn
Track cumulative API costs across multi-turn agent sessions. Visualize which conversation stages (e.g., initial query, tool call, follow-up) consume the most tokens, enabling optimization of prompt design and agent reasoning steps to reduce spend without impacting quality.
Tool Call Success & Error Rate Trends
Monitor the reliability of external API integrations used by LLM agents. Chart success rates, error types (timeout, auth, malformed request), and latency for each tool (e.g., Salesforce API, database query). Identify brittle dependencies for engineering prioritization.
Retrieval Relevance Score Distribution
Analyze the quality of document chunks retrieved by RAG systems. Plot the distribution of similarity scores between user queries and retrieved content. Spot degradation in embedding performance or chunking strategy, triggering re-indexing workflows.
User Feedback Sentiment vs. Model Confidence
Correlate LLM confidence scores (logprobs) with post-interaction user feedback (thumbs up/down). Visualize discrepancies to identify overconfident but incorrect answers or underconfident high-quality responses, guiding prompt calibration and guardrail adjustments.
PII Detection & Redaction Compliance Tracking
Govern data privacy by charting the volume and types of Personally Identifiable Information detected in LLM inputs/outputs. Monitor redaction effectiveness over time and segment by model or user group to ensure compliance with internal policies and regulations like GDPR.
Multi-Model Performance & Cost Comparison
Build a custom radar or parallel coordinates chart comparing multiple LLM providers (OpenAI GPT-4, Anthropic Claude, Cohere) across dimensions: cost per 1k tokens, task-specific accuracy, latency, and hallucination rate. Use for runtime routing decisions and vendor strategy.
Example Workflows: From Data to Custom Dashboard
These workflows demonstrate how to instrument LLM applications to log custom metrics, visualize them in Weights & Biases custom charts, and trigger operational alerts. Each example connects a specific LLM use case to a W&B panel for engineering and product oversight.
Trigger: A user message is processed by a conversational agent (e.g., a customer support bot).
Context/Data Pulled: The application logs the following per turn:
session_iduser_turn_countmodel_used(e.g.,gpt-4-turbo,claude-3-sonnet)prompt_tokenscompletion_tokenstotal_tokensestimated_cost(calculated using provider's per-token pricing)
Model/Agent Action: The LLM generates a response. The application's custom callback handler or wrapper captures the token counts from the provider's API response.
System Update: The metrics are sent to W&B as a custom run log using wandb.log().
python# Example log call within your application wandb.log({ "turn/total_tokens": total_tokens, "turn/estimated_cost_usd": estimated_cost, "turn/model": model_used, "session_id": session_id }, step=user_turn_count)
Dashboard & Alert: A custom W&B Line Plot panel is configured to:
- Display average
turn/total_tokensandturn/estimated_cost_usdover time, segmented byturn/model. - Set a W&B alert to trigger a Slack notification if the 7-day rolling average cost per conversation exceeds a defined budget threshold.
Human Review Point: Anomalous spikes in token usage trigger a review of recent conversation logs to identify prompt inefficiencies or unexpected user behavior.
Implementation Architecture: Data Flow, APIs, and Model Layer
A production-ready architecture for streaming custom LLM metrics from your application to Weights & Biases for visualization and alerting.
The integration connects your live LLM application to W&B's logging API (wandb.log) to stream custom metrics in real-time. For a conversational agent, you would instrument key points in the code—such as after each user turn—to capture dimensions like tokens_used, tool_calls_attempted, tool_success_rate, and derived metrics like cost_per_turn. This data is sent as a dictionary payload to a dedicated W&B run, often initialized at the start of a user session or batch job. The W&B SDK handles batching and network retries, ensuring telemetry capture doesn't block your primary application workflow.
Within the W&B interface, you use the Custom Charts panel builder to create visualizations from this logged data. For monitoring token usage, you might build a line chart grouping tokens_used by conversation_id and model_variant. For tool reliability, a bar chart could display tool_success_rate segmented by tool_name. These panels are then composed into a dedicated dashboard, providing a real-time operational view. Crucially, you can set alerts on these custom metrics (e.g., "alert if tool_success_rate < 95% over last 100 calls") that trigger via Slack, PagerDuty, or webhooks to your incident management system.
Governance is enforced at the data layer. Before logging, your application should strip any sensitive data (PII, PHI) from the metric payloads. Access to the W&B project and dashboard is controlled via W&B's RBAC and SSO integration, ensuring only authorized MLOps and engineering team members can view detailed operational data. For audit purposes, the entire metric lineage—from the application code commit that generated it to the specific W&B run and panel where it's displayed—is preserved, which is critical for debugging and compliance reviews. This architecture turns opaque LLM operations into a governed, observable system, enabling teams to optimize costs, reliability, and user experience based on empirical data.
Code Examples: Logging Custom Metrics and Building Panels
Tracking Token Usage Per Conversation Turn
To monitor cost and efficiency, you can log custom metrics for each turn in a multi-turn LLM conversation. This example logs the total tokens used and the conversation turn number for each interaction, enabling analysis of cost escalation in long-running sessions.
pythonimport wandb from openai import OpenAI # Initialize W&B run run = wandb.init(project="llm-conversation-monitoring", job_type="inference") client = OpenAI() conversation_history = [] for turn in range(1, 6): # Simulating a 5-turn conversation user_message = f"User query for turn {turn}" conversation_history.append({"role": "user", "content": user_message}) response = client.chat.completions.create( model="gpt-4", messages=conversation_history, max_tokens=150 ) assistant_reply = response.choices[0].message.content conversation_history.append({"role": "assistant", "content": assistant_reply}) # Log custom metrics for this turn run.log({ "conversation_turn": turn, "total_tokens": response.usage.total_tokens, "prompt_tokens": response.usage.prompt_tokens, "completion_tokens": response.usage.completion_tokens, "cumulative_tokens": sum([r.usage.total_tokens for r in responses_log]) }) run.finish()
This creates a time-series plot in W&B showing token consumption growth, helping identify conversations that become inefficient.
Operational Impact: Time Saved and Risk Reduced
How building custom W&B dashboards for unique LLM metrics accelerates troubleshooting and reduces operational risk.
| Metric | Before AI | After AI | Notes |
|---|---|---|---|
Time to diagnose retrieval failure | Hours of log analysis | Minutes via dashboard alerts | Custom chart tracks tool call success rates and chunk relevance scores |
Model cost anomaly detection | Monthly invoice review | Real-time spend dashboards | Custom panel visualizes token usage per user or conversation turn |
Performance regression review | Ad-hoc data pulls and manual comparison | Automated A/B test dashboards | Statistical comparison of new vs. baseline model across custom metrics |
Identifying biased output segments | Manual sampling and review | Segmented analysis by user cohort | Custom chart slices sentiment or toxicity scores by demographic metadata |
Prompt template effectiveness | Qualitative user feedback | Quantified prompt version performance | Dashboard tracks business KPIs (e.g., conversion) linked to prompt hash |
Data drift impact assessment | Reactive investigation after user reports | Proactive correlation of drift to KPIs | Custom chart overlays input distribution shifts with accuracy metrics |
Stakeholder reporting on AI health | Manual slide deck creation | Automated, shareable report generation | Executive dashboard consolidates custom health scores from multiple charts |
Governance and Phased Rollout Strategy
A structured approach to deploying and governing custom W&B charts ensures your LLM monitoring delivers actionable insights without creating dashboard sprawl or compliance gaps.
Begin by integrating the W&B SDK into your core LLM inference and evaluation pipelines to log custom metrics like token_usage_per_turn, tool_call_success_rate, or sentiment_score. Treat these custom panels as versioned assets—store their configuration (queries, aggregations, chart types) in Git alongside your prompt templates and evaluation code. This allows you to track changes, roll back problematic visualizations, and maintain a clear lineage between a metric's definition and the data it displays.
Adopt a phased rollout: first, deploy charts to a development project in W&B for data science and engineering teams to validate metric accuracy and query performance. Next, promote a curated set to a staging project shared with product and operations stakeholders to align on KPIs and alert thresholds. Finally, lock down the production project with strict RBAC, ensuring only authorized users can modify core dashboards while view-only access is granted to broader business teams. Use W&B's reporting features to automate snapshot distribution for executive reviews.
Govern these visualizations by linking them to specific business objectives and LLM use cases. For example, a chart tracking hallucination_rate_by_document_source should have a defined owner, a documented review process for spikes, and a clear integration with your incident management system (e.g., PagerDuty). Implement data retention policies within W&B to automatically archive old runs, controlling costs and ensuring compliance. This structured approach transforms custom charts from ad-hoc analysis tools into governed components of your AI operations, providing reliable visibility for model performance, cost control, and regulatory reporting.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
FAQ: Technical and Commercial Questions
Building custom visualization panels in Weights & Biases for unique LLM metrics requires careful planning around data collection, chart logic, and operational integration. Below are answers to common technical and commercial questions from teams implementing this capability.
To build meaningful custom charts, you must instrument your application to log specific, structured data points to W&B. This typically involves extending your existing logging calls.
Core Data Points to Log:
- Conversation/Session ID: A unique identifier to group related turns.
- Turn/Timestamp: The sequence number or timestamp of each user/assistant exchange.
- Token Usage: Breakdown of prompt, completion, and total tokens per turn, per model provider.
- Tool Call Data: For agentic workflows, log the tool name, input parameters, success/failure status, execution duration, and any error messages.
- Custom Metrics: Pre-calculated scores like sentiment (using a lightweight model), relevance score (from a retrieval step), or a binary flag for a business outcome (e.g.,
resolved: true). - Metadata: User cohort, model variant, prompt version, and deployment environment.
Example W&B Log Call (Python SDK):
pythonimport wandb # Log a single conversation turn wandb.log({ "conversation_id": "conv_abc123", "turn": 5, "prompt_tokens": 120, "completion_tokens": 85, "total_tokens": 205, "tool_calls": [ {"name": "get_weather", "success": True, "duration_ms": 450} ], "calculated_sentiment": 0.8, "user_tier": "enterprise", "model": "gpt-4-turbo" })
Without this granular, per-event logging, custom charts will lack the necessary data dimensions for useful analysis.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us