Guide

How to Implement Liquid Cooling in High-Density AI Data Centers

A technical guide to deploying direct-to-chip and immersion cooling for GPU racks. Learn to select vendor solutions, retrofit existing infrastructure, and integrate cooling with facility management to achieve optimal Power Usage Effectiveness (PUE).

Get in touch Learn more

Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.

This guide details the technical implementation of direct-to-chip and immersion liquid cooling for GPU racks powering large-scale model training. It compares vendor solutions from CoolIT, Asetek, and GRC, and provides a step-by-step plan for retrofitting existing infrastructure or designing new deployments. You will learn how to integrate liquid cooling with facility management systems to achieve optimal Power Usage Effectiveness (PUE).

Liquid cooling is the essential thermal management solution for high-density AI compute, where air cooling fails to dissipate the 1kW+ heat loads of modern GPUs. The primary methods are direct-to-chip (D2C), which uses cold plates on processors, and immersion cooling, which submerges entire servers in dielectric fluid. D2C offers a modular retrofit path, while immersion provides the ultimate heat transfer for the highest power densities, enabling rack-level power draws exceeding 100kW. This shift is foundational for sustainable AI infrastructure, drastically reducing energy for cooling and enabling heat reclamation.

Implementation requires a systematic approach: first, assess your facility's power, space, and water availability. Next, select a cooling architecture and vendor based on rack density and total thermal design power (TDP). For new builds, design for containment and integrate cooling control with your data center infrastructure management (DCIM) system. For retrofits, plan a phased deployment, starting with a proof-of-concept rack. Finally, instrument everything to monitor coolant temperature, flow rates, and pump power, linking this data to your overall Power Usage Effectiveness (PUE) calculations for continuous optimization.

IMPLEMENTATION GUIDE

Key Concepts: Liquid Cooling Architectures

Master the core architectures and vendor solutions for deploying liquid cooling in high-density AI clusters. This is the foundation for sustainable, high-performance infrastructure.

Direct-to-Chip (Cold Plate) Cooling

This architecture uses a cold plate mounted directly on high-heat components like CPUs and GPUs. A coolant (typically water) circulates through micro-channels in the plate, absorbing heat before being transported to a heat exchanger.

Key Advantage: High efficiency for concentrated heat sources; easier retrofit than immersion.
Implementation: Requires a facility-level distribution system (CDU) and careful leak-proof plumbing at the server level.
Vendor Examples: CoolIT Systems, Asetek.

~30%

Cooling Energy Savings

> 1 kW

Heat Removal per Chip

EXPLORE

Single-Phase Immersion Cooling

Servers are fully submerged in a dielectric fluid that does not conduct electricity. The fluid absorbs heat through direct contact and is circulated to a heat exchanger, where it rejects heat without changing phase.

Key Advantage: Eliminates fans and enables extreme power densities (>50kW per rack).
Consideration: Requires specialized tanks and fluid maintenance; server components must be compatible (e.g., no spinning HDDs).
Vendor Example: GRC (Green Revolution Cooling).

PUE <1.03

Achievable Efficiency

50kW+

Per Rack Density

EXPLORE

Two-Phase Immersion Cooling

Similar to single-phase, but the dielectric fluid boils upon contacting hot components. The vapor rises, condenses on a cooled coil, and returns as liquid. The phase change provides extremely efficient heat transfer.

Key Advantage: Highest thermal efficiency and uniform temperature control.
Consideration: More complex system design; fluid cost is higher. Best for the highest-density deployments.
Vendor Example: 3M with Novec engineered fluids.

~90%

Cooling Energy Reduction

> 100kW

Per Rack Density

EXPLORE

Coolant Distribution Unit (CDU)

The CDU is the heart of a liquid cooling system. It acts as the interface between the warm coolant returning from the IT equipment and the facility's cooling water (or dry cooler).

Core Functions: Pump control, flow regulation, temperature monitoring, and leak detection.
Integration: Must connect to the Building Management System (BMS) and Data Center Infrastructure Management (DCIM) software for holistic control.
Action: Select a CDU with redundant pumps and compatibility with your chosen cooling architecture.

Facility Integration & Heat Rejection

Liquid cooling shifts the heat rejection problem from the server room to the facility perimeter. You must design the final step of heat rejection.

Options: Dry coolers (air-cooled), cooling towers (evaporative), or integration with a district heating system.
Key Metric: Maximize hours of free cooling by using ambient air or water to cool the loop without mechanical chillers.
Step: Conduct a climate analysis for your site to determine the optimal heat rejection technology.

Retrofit vs. Greenfield Deployment

Your implementation path is dictated by existing infrastructure.

Retrofit: Involves installing cold plates or immersion tanks into existing racks. Requires assessment of floor load, power distribution, and rack space. A phased approach is critical.
Greenfield: Design the data center around liquid cooling from the start. This allows for optimal rack layout, piping, and heat rejection design, achieving the lowest possible PUE.
Next Step: Review our guide on How to Launch a Liquid Cooling Retrofit for Existing AI Infrastructure.

DIRECT-TO-CHIP & IMMERSION SOLUTIONS

Vendor Comparison: CoolIT, Asetek, GRC

A technical comparison of leading liquid cooling vendors for retrofitting or deploying new high-density AI GPU racks. This table evaluates key factors for integration, performance, and operational sustainability.

Feature / Metric	CoolIT Systems	Asetek	GRC (Green Revolution Cooling)
Primary Cooling Method	Direct-to-Chip (Cold Plate)	Direct-to-Chip (Cold Plate)	Single-Phase Immersion
Coolant Type	Dielectric fluid or water	Dielectric fluid or water	Dielectric fluid (e.g., ElectroSafe)
Max Heat Density Supported	50 kW per rack	45 kW per rack	150 kW per rack
Retrofit Kit Availability
PUE Reduction Potential	< 1.10	< 1.10	< 1.03
Integration with DCIM/BMS	API & SNMP support	RackCDU management software	Cerebra AI management platform
Waste Heat Reclamation Ready	60°C output	60°C output	70°C output
Typical Fluid Maintenance Interval	5 years	5 years	10+ years

FOUNDATIONAL ANALYSIS

Step 1: Assess Your Infrastructure and Workloads

Before selecting a cooling technology, you must conduct a rigorous technical and financial assessment of your existing environment and projected AI workloads. This step defines the constraints and requirements for your entire liquid cooling implementation.

Begin by profiling your computational density. Measure the thermal design power (TDP) per rack, focusing on GPU models and their peak heat output. Simultaneously, audit facility constraints: power distribution unit (PDU) capacity, floor load limits, and existing cooling distribution unit (CDU) infrastructure. This quantifies the gap between your current air-cooling capacity and the demands of high-density AI training, establishing the baseline for your Power Usage Effectiveness (PUE) improvement target.

Next, analyze workload patterns. Characterize jobs by their duration, power consistency, and scheduling predictability. Long-running, steady-state training jobs are ideal for direct-to-chip or immersion cooling, while bursty inference workloads may suit a hybrid approach. This analysis directly informs the business case, weighing the capital expenditure of retrofitting against the operational savings from reduced energy and water use. A clear assessment prevents costly over-engineering or under-provisioning.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LIQUID COOLING IMPLEMENTATION

Common Mistakes

Implementing liquid cooling in AI data centers is a high-stakes engineering project. Avoiding these common pitfalls is critical for achieving the promised Power Usage Effectiveness (PUE), reliability, and return on investment.

PUE (Power Usage Effectiveness) measures total facility energy divided by IT energy. A common mistake is treating liquid cooling as a silver bullet without optimizing the supporting infrastructure. If you install a direct-to-chip system but keep the room's computer room air handlers (CRAHs) at full blast, you're wasting energy.

The fix is integrated control:

Implement a Building Management System (BMS) that dynamically adjusts CRAH/CRAC fan speeds based on the heat captured by the liquid loop.
Use containment (hot or cold aisle) to prevent mixing and allow for higher room temperatures.
Validate PUE at the rack level, not just the facility meter, to identify inefficiencies. For a broader architectural view, see our guide on How to Design a Sustainable Cloud Architecture for AI Workloads.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

How to Implement Liquid Cooling in High-Density AI Data Centers

Key Concepts: Liquid Cooling Architectures

Direct-to-Chip (Cold Plate) Cooling

Single-Phase Immersion Cooling

Two-Phase Immersion Cooling

Coolant Distribution Unit (CDU)

Facility Integration & Heat Rejection

Retrofit vs. Greenfield Deployment

Vendor Comparison: CoolIT, Asetek, GRC

Step 1: Assess Your Infrastructure and Workloads

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Common Mistakes

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there