Could the lifespan of GPUs in data centers be shortened due to consistently operating at high utilization rates? This question was posed by an anonymous user on X, and while the notion is intriguing, it remains speculative at this juncture.
The post included a screenshot of an unverified comment from someone identified as a “GenAI principal architect at Alphabet Inc.” This individual suggested that if the utilization rates hover around 60% to 70%, which is reportedly the case for companies like CoreWeave and Lambda Labs, the lifespan of a GPU could dwindle to three years, rather than the expected five years. On the other hand, with reduced utilization, it’s believed that operators could enjoy the full five years of useful life from a GPU.
This proposition isn’t without merit. GPUs tend to operate at very high temperatures. For instance, the Hopper generation can consume over 700 watts, and the anticipated Blackwell generation might require up to 1000 watts. They generate so much heat that efficient air cooling is impractical, compelling the implementation of water cooling solutions.
Moreover, there’s anecdotal evidence that supports this view. There are numerous accounts of gamers purchasing second-hand, high-end GPU cards that had been extensively utilized for crypto mining. These cards often operated continuously for months or even years, eventually failing after being sold to unsuspecting gamers who were unaware that they had acquired previously heavily used components.
Nevertheless, the act of starting up and shutting down a PC is arguably more detrimental than keeping it continuously operational, according to Jon Peddie, president of Jon Peddie Research, an organization dedicated to graphics research (full disclosure: I provide some assistance to JPR) and who possesses a degree in electrical engineering.
“The factor that harms a chip is the process of powering it on and off; this results in temperature cycling which affects the connections,” he explained to me. “The only justification I can think of for a [data center add-in board] to fail is the excess heat generated by adjacent AIBs. I heat my modest lab with a single RTX 4090 – seriously.”
It seems Google may be feeling the pressure from this (no pun intended) as it responded with a strong denial.
“Recent statements regarding Nvidia GPU hardware usage and lifespan attributed to an ‘unnamed source’ were incorrect, do not reflect how we use Nvidia’s technology, and are not representative of our experience. Nvidia GPUs are essential to our infrastructure, essential for both our internal operations and Cloud services, and our experience with Nvidia GPUs aligns with industry norms,” a Google representative stated in an email.
Although there are accounts from gamers who experienced disappointment after purchasing a used mining card on eBay, there haven’t been any reports from actual customers regarding the premature failure of a data center GPU. Therefore, consider this information to be speculative rather than factual.