Power and cooling resilience in the age of AI: Strengthening Data Center uptime
Carsten Ludwig / R&M
Uptime is a key topic for mid-size and large data centers. There’s a marked desire to push availability as far past 99.99% as possible. However, if even one of the operational technologies used is improperly planned and executed, DCs will fall very short of this mark.
Due to the increasing demands on power and cooling from artificial intelligence workloads, data centers are grappling with unprecedented challenges. This shift is not subtle—it's transformative. Ensuring uptime resilience in this environment requires more than redundant systems; it calls for an integrated, forward-thinking approach that considers every layer of infrastructure, from rack layout to facility-wide thermal management. What’s more, this demands a high degree of flexibility in allocating required sources such as connectivity, power, and cooling, as well as extremely fast movement of applications between racks in the server room.
Power and cooling failures continue to be significant risks to data center operations. That means resilience must be built in from the start. Planning begins with a holistic view of the entire site, considering not just construction and room layout, but also network architecture, power distribution, cooling systems, environmental monitoring, and ongoing maintenance. A cohesive Integrated Infrastructure Solutions strategy is fundamental to developing and maintaining a successful solution while also keeping an eye on Total Cost of Ownership.
A closer look at power
Power systems, in particular, face growing challenges. AI-ready servers and processors consume more energy than ever. Where server CPUs once drew under 100 watts a decade ago, today's models routinely exceed 200 watts, and many high-performance servers average 500 watts or more. Specialized chips with hundreds or even thousands of cores, such as Graphics Processing Units (GPUs), AI accelerators, and massively parallel processors, further elevate power consumption. A single GPU can consume over 300 watts, and when clustered for AI workloads, these systems can demand five times the power of traditional server environments—often within the same physical footprint.
Power infrastructure must be capable of handling not just sustained high loads, but also rapid fluctuations. AI systems are particularly sensitive to power quality; sudden voltage spikes or shifts in phase can interrupt critical AI training. These risks underscore the importance of intelligent, high-quality power management. From the uninterruptible power supply (UPS) to the power distribution units (PDUs), every component must be engineered to respond quickly and accurately to shifting loads. Monitoring must occur in real time, and insight must reach a granular level. Visibility into the health of components such as fans and drives helps identify early signs of wear or failure risk. The same high level of performance also must be delivered when adapted power supply solutions are introduced, for example power transmission approaches such as busbar systems.
Resilient power also depends on smart design and operational discipline. Simple oversights, such as improper maintenance, errors, insufficiently dimensioned poorly managed and labeled racks, overlooked diagnostics, or degraded UPS components can lead to downtime. These failures rarely make headlines, but they are common. Avoiding them requires meticulous layout planning, active power tracking, and strong operational protocols. UPS systems must be synchronized and regularly tested to eliminate single points of failure. Power continuity is only as strong as the weakest link—which often human error. Comprehensive staff training and scenario-based testing are essential to reliability. The ability to anticipate failures, rather than simply react, sets resilient data centers apart.
A closer look at cooling
Significant heat generation pushes legacy air-cooling systems to their limits, if not beyond. The equation is simple: more energy input means more heat output. Data centers, in essence, are governed by thermodynamics.
In response, many operators are turning to advanced cooling solutions, particularly liquid cooling. Unlike air cooling, which struggles to manage extreme heat in compact spaces, liquid cooling provides significantly greater efficiency.
Direct-to-Chip Cooling involves circulating liquid through pipes attached directly to chips and servers.
o Single-phase cooling transfers heat without changing the liquid's state.
o Two-phase cooling allows the liquid to evaporate, absorbing more heat.
Though highly effective, these methods still require supplemental air cooling to eliminate residual heat.
As Ascent engineering and construction VP Brad Pauley recently pointed out in a Data Center Dynamics article, “AI workloads dramatically increase power density in data centers, typically exceeding 120-136kW per rack – several times that of traditional servers – and often require a hybrid cooling approach, with 22-25 percent air cooling and 75-78 percent liquid cooling.”
What applies to power also applies to cooling: software applications move through racks at speed, so the capacity to supply higher levels of power and cooling must be in place upfront. In terms of cooling, you could plan for a certain amount of redundant capacity to accelerate cooling when requirements peak at very short notice.
More immersive solutions are also gaining traction. Immersion cooling submerges IT hardware in non-conductive fluids, allowing for unparalleled heat transfer. However, this approach necessitates substantial changes in hardware orientation and physical layout, as well as closer collaboration between IT and facilities teams.
This also requires totally new maintenance concepts – after all, a moving PCB dripping with liquid and weighing more than 50kg is not easy to handle. Besides that, the floor plans related to horizontal (instead of vertical) deployments demand space and require new floor support systems. In many cases, the traditional floor plan may no longer be possible.
Hybrid approaches are another viable option, adapting cooling methods to different thermal zones within the data center. Rear-door heat exchangers and in-row cooling units help manage hotspots and support diverse workload requirements.
Implementing liquid cooling requires more than just technical know-how. It demands specific infrastructure, including manifolds, pumps, heat exchangers, and leak detection systems. These components must be carefully integrated with the overall cooling and monitoring framework, particularly with data center infrastructure management (DCIM) platforms. Dedicated rack space and specialized planning are necessary to support cooling loops and remove heat efficiently. The investment is significant, but so are the benefits.
Liquid cooling is emerging as the most energy-efficient and sustainable solution for high-performance computing. It not only reduces energy consumption by up to 90 percent compared to traditional air cooling but also contributes to better power usage effectiveness (PUE) and lower operational costs. For data centers committed to sustainability, it offers a clear path forward providing the facility has the expertise and infrastructure to support it. Liquid cooling requires special thermal management, expertise, and skilled staff.
Striking the right balance
Ultimately, resilience in the data center is about readiness. The cost of downtime, particularly for AI workloads, is too high to accept unnecessary risk. However, overdesigning a facility can inflate costs and create inefficiencies.
Striking the right balance is essential, and this is where Integrated Infrastructure Solutions provides real value. By uniting power, cooling, monitoring, and maintenance under a single strategic vision, operators can scale with confidence, knowing their systems are prepared not only for today’s demands but also for what’s next.
Carsten Ludwig is the market manager of data center at R&M.