xAI Launched Grok 3. Some Things About AI Data Centers.

in #hive-1679225 days ago

Elon Musk described Grok 3 recently as "scary smart", but he's known to hype things up when it suits him. So far, not many had access to the new model of xAI for intensive tests, but looks like the ones that did have been impressed, especially since xAI has been in the market for only 1 year. Here's the presentation, if you want to see it.

Looks like Grok 3 and the mini version outperform all competitors in the benchmarks presented by xAI in the presentation, and that while the full model is in beta and not even trained for as long as the mini version.

A more competent opinion is provided on X by Andrej Karpathy, someone I've seen with relevant AI tests and opinions in the past. He had access to Grok 3 and compared it to a number of other models, on a series of quick but relevant tests.

This is his summary at the end:

OpenAI recently launched o3 mini, and the full o3 will likely soon follow. If o3 is better than o1 (likely), than OpenAI will still have an edge compared to xAI, but the gap is slowly closing if they won't switch. The charts xAI presented show a remarkable growth rate for Grok, compared to their rivals.

The Race for AI Data Centers

But I want to talk a bit about their infrastructure, the 200,000 GPU cluster they have in Memphis, half of them being H100 and the other half H200, which is still the largest AI-focused data center cluster in the world.

The question immediately arises: but doesn't Google have bigger data centers? And the answer is: yes, it does, but not as a single cluster, with GPUs tightly connected. They have huge datacenters in the same location, but multiple buildings, so not as tightly connected.

So, why clusters? Aren't distributed data centers fine for AI? And the response is that they are perfectly fine for inference (i.e., in production, using compute to produce results), but not for training. For training, it's much better to have access to a big cluster, to cut the duration of the training as much as possible.

That's why all the tech giants in the US have plans to build this year clusters much bigger than xAI, somewhere in the range 400-600k GPUs.

If you think that is insane, well, xAI's next step is a cluster of 1m GPUs (5x the size of the current cluster in Memphis). In case anyone wanted to short nVidia, think twice.

Challenges for New Data Centers

(apart from high-end GPU supplies, I mean)

Cooling

Obviously, something like the data center in Memphis isn't air-cooled anymore. They are using liquid cooling, which seems to be the way to go, unless a more revolutionary method of cooling isn't invented in the meantime.

Power Requirements

This one is tough. If I don't mix things up, I think xAI's cluster in Memphis needs the power of an entire small town to run. Since bigger clusters are on the way of being built, they'll have bigger power needs.

The biggest bottleneck is the power grid, transportation of electricity. That's why the new clusters usually partner with businesses from the energy domain to have power generation built very close to the data centers.

People say small nuclear power plants would be needed, but they are built over a long time and the pace in AI is much higher than that. So, they'll have to figure out other solutions.

Power Fluctuation

The power fluctuation is extreme between the times when GPUs are at 100% and when they are idle, as the xAI team discovered at their cluster. The power generators apparently can't handle that. The solution they came up to was to give GPUs bogus tasks when they should be idle, so that they keep them working all the time, and keep the energy requirement relatively constant in time.

Much of what I included in the section about data centers comes from listening to this 5+ hours podcast of Lex Fridman, a little bit at a time for a week or so. Feel free to do the same, there's much more information there.

Posted Using INLEO

Sort:  

Hm, the interesting factoid about the "bogus tasks" was not validated by Grok, but I guess "he" is biased 😀.

image.png

Well... Elon Musk referred to it in the presentation, but didn't want to say how they fixed the problem. I know their solution from that podcast I linked at the end. The guy who shared it either knows an impressive amount of things about anything tech in the world, or he invents some of it. Somehow I doubt it's the latter because there were two guests and the other one could have objected if he started spreading lies. They did occasionally have disagreements on nuances of certain things they talked about, but they were mostly on the same page. By the way, Lex Fridman, the host, also visited the cluster in Memphis, so he probably knows quite a few details too...

I'm trying to understand most of the development on the X AI data center which you are analyzing. The analysis is comprehensive

Let me know if you have questions.

It's crazy to see how fast AI is progressing, but I don't know if I can trust the data in the presentation. I wouldn't be surprised if they manipulated it in a way to get those numbers, but it's probably quite close to the existing models. There is already a power issue, and honestly, I don't know how things will get resolved.

It wouldn't be the first time they manipulate a presentation. But we will know soon enough from independent testing.

Yep, power is a big bottleneck. Maybe bigger than the chips. It's interesting to see how everything unfolds.

Elon Musk' is a business man so I think hype is probably one of his strongest suits. I don't think GROK3 will be anything as good. However, let's see how it goes.

Well, only people paying them a lot (i.e. Premium+) are able to even see it, so... it'll be a while to have some "unofficial" hard tests.

What is most amazing is that xAI is a lot cheaper than Open AI and rank better on LM Areana

Yes, I saw Premium+ for Grok 3 is significantly cheaper than o1 pro. What's interesting on LM Arena is that Gemini's latest models are ranked 2nd and 3rd.

Gemini's latest models are ranked 2nd and 3rd.

They used to be on top when they released. LM Arena Leaderboard is based on blind testing. The results should be close to what the real rankings should be based on the needs of the prompters.

LM Arena Leaderboard is based on blind testing.

Ah, they talked about this in the presentation. I didn't make the connection.