When it comes to training large-scale language models, the usual argument is that GPU shortages are a problem. Of course, Nvidia's dominant chips are what various AI companies are competing to acquire.
But everyone's favorite billionaire and technology prophet sees another problem. Musk says his startup xAI's next Grok 3 generation AI model will require about 100,000 Nvidia H100 GPUs to train the model.
Admittedly, getting 100,000 H100s will not be easy. Nor is it cheap. But here's the problem. Each H100 consumes 700W peak power. In other words, 100,000 units are 70 megawatts peak. Probably not all 100,000 units will be running at 100% load at the same time. But there is more to setting up AI than just GPUs. All kinds of supporting hardware and infrastructure are involved. [i.e., 100,000 H100 units would be over 100 megawatts or about the same as a small city. Another data point is that in 2022, the entire Paris region would have 500 megawatts worth of data centers in operation.
In other words, 100 megawatts for just one LLM is a bit of a problem. And in an interview with Norwegian asset fund CEO Nicolai Tangen at X Spaces (via Reuters), Musk said that GPU availability is and will continue to be a major constraint for the development of AI models, but access to sufficient power is increasingly a limiting factor, he stressed.
And Musk also predicted that AGI (artificial intelligence) will surpass human intelligence within two years. Musk said, "If you define AGI [artificial intelligence] as smarter than the smartest human, I think that's probably within the next year or two."
But he also predicted in 2017 that self-driving cars reliable enough that you could "fall asleep" in them were two years away. We are still waiting on this. And he predicted on March 19, 2020 that there would be "close to zero new cases" of covid19 in the US by the end of April. Oops!
Anyway, Mask's somewhat tinny technical prediction is not exactly news. But he probably has some pretty firm ideas about the number of GPUs needed to train the next generation of LLMs. So city-sized power budgets are likely a reality and a bit of a concern.
Furthermore, the current model of xAI, Grok 2, apparently only required 20,000 H100s. This means that from one AI model to the next, the number of GPUs has increased by a factor of 5. This is a scaling that does not seem very sustainable, either in terms of number of GPUs or power consumption.
Comments