In our previous blog post, we demonstrated how the Carbice Pad maintains optimal temperatures for high-performance GPUs, even after years of operating at peak power. This update shares results from ongoing thermal tests, including new thermal shock tests, demonstrating how the Carbice Pad's reliability can support efficient power usage and sustainability in high-performance computing (HPC) data center applications. In these applications, power and cooling can go out, causing periods of thermal shock that the hardware must survive without loss of performance.
The Reliability and Sustainability of the Carbice Pad
The demand for computing power is increasing at an astonishing rate year after year. At the same time, the proliferation of faster and hotter chips continues to test the reliability of the thermal interface materials (TIMs) designed to keep them from overheating. TIMs play a vital role in transferring heat from power intensive electronic components, such as advanced GPUs, to heat sinks designed to dissipate the heat. This in turn protects components from becoming damaged due to overheating.
Figure A: A basic representation of the air gaps between a heat source and a heat sink.
Figure B: This shows how a thermal interface material fills the air gaps between two non-uniform surfaces.
New high-density data centers equipped with advanced AI GPUs, like Nvidia’s H100, are pushing the limits of hardware, power, and thermal management even further. Some of the most powerful consumer gaming GPU’s currently have thermal design power (TDP) ratings over 400W, while the NVIDIA® H100 GPU has a TDP of 700W. Many data center operators are already preparing new advanced cooling solutions to combat the heat produced by future chips in development with TDPs upwards of 1000W.
As chips become more expensive and advanced, it is increasingly important to maximize their performance over the longest possible timeframe to ensure a return on investment. Furthermore, in data center environments the performance of one GPU affects the speed and performance of all the GPUs, and one TIM failure driving temperatures up can slow down the entire data center capability. With that in mind, let's dive into our updated testing on the Nvidia GeForce RTX 4080 Super GPU.
Our thermal cycle testing has now surpassed 6,000 cycles, equivalent to 3 years of use, exceeding the legacy industry standard of around 5,000 cycles for reliability testing (we believe that the standards need to be updated for the new demands in data centers). At this stage, the Nvidia GPU continues to maintain a stable temperature of 69°C under maximum power usage. These results highlight the exceptional reliability of Carbice Pads, even after years of demanding use.
HPC Has a Sustainability Challenge
The escalating demand for advanced computing power, driven by emerging technologies like artificial intelligence, big data analytics, and the Internet of Things (IoT), is outpacing supply and presenting significant sustainability challenges. This surge in demand is not only straining existing infrastructure but also carries several downsides, including:
- Unsustainable Power Consumption
- Stress on the Power Grid
- Inefficient Resource Utilization
- E-waste Generation
- Degraded and Damaged Equipment
A recent paper released by Meta entitled, The Llama 3 Herd of Models, analyzed a 54-day training period on their AI platform’s large language model (LLM) called Llama 3 405B. The training utilized 16,000 GPUs, each with 700W TDP. In that period, Meta claims that out of 419 unexpected interruptions to the training, 78% were due to issues related to hardware. GPU’s, not surprisingly, accounted for 58% of all the unexpected interruptions. They also recorded daily impacts to GPU dynamic voltage and frequency scaling caused by higher midday temperatures.
The growing demand for high-performance computing is presenting data centers worldwide with new and complex challenges. Retrofitting existing infrastructure to handle unprecedented heat output is leading to issues such as throttling, downtime, and damaged components.
To address this, we tested the Carbice Pad by simulating harsher environments, increasing ambient temperatures through thermal shock to see how it would perform.
The testing methodology involved running FurMark benchmarking software at 100% GPU power usage while conducting thermal shock testing in a controlled environmental chamber. The chamber's ambient temperature was adjusted to simulate extreme conditions, allowing us to observe the GPU's thermal response with the Carbice Pad under maximum load.
The thermal shock testing highlights the Carbice Pad's ability to maintain consistent thermal performance under extreme ambient temperature fluctuations from 30°C to 57°C. These extreme conditions accelerate the pump out of thermal greases and PCMs leading to rapid degradation and temperature rise, which is the reason that the industry today recommends “re-pasting” after just 2 years. With Carbice, even with 57°C ambient temperatures, the GPU held steady at around 99°C, well below its 110°C maximum operating temperature. Most notably, the GPU temperature reverts back to a stable lower temperature value once the thermal shock is removed, highlighting the Carbice Pad’s elasticity and durability to the large differential thermal expansion that such thermal shocks bring. This illustrates the Carbice Pad’s effectiveness in preventing thermal buildup, ensuring reliable heat transfer and stable GPU performance even under repeated harsh conditions.
Most notably, the GPU temperature reverts back to a stable lower temperature value once the thermal shock is removed, highlighting the Carbice Pad’s elasticity and durability to the large differential thermal expansion that such thermal shocks bring.
Better Performance Contributes to Efficient Power Usage
Power consumed by data centers already exceeds 1% of the total global power supply, a figure that continues to grow. To mitigate this, data centers are focusing on improving power usage efficiency (PUE) to reduce their carbon footprint. A key factor in achieving better PUE is maintaining the long-term performance of electronic components, as degraded performance can lead to inefficiencies and increased energy consumption. Over time, the performance degradation of traditional thermal interface materials could increase energy consumption, whereas the Carbice Pad maintains its performance, contributing to long-term energy savings.
The Carbice Pad not only delivers exceptional thermal performance but also plays a vital role in sustainability by reducing long-term energy consumption. Extensive testing shows that the Carbice Pad maintains stable temperatures even after years of demanding use, ensuring GPUs and other high-power components run efficiently throughout their lifespan. By preventing thermal degradation, the Carbice Pad helps data centers avoid the energy inefficiencies associated with overheating, directly improving PUE. This reliability extends the lifespan of electronic components, ultimately reducing energy costs and minimizing environmental impact over the equipment's lifetime.
Just as copper is the standard metal for heat sinks, Carbice Pads are a new standard for data center thermal interfaces. Wide adoption will greatly simplify the significant challenge of system level power and thermal management of data center assets.