Race for High-Quality Data Continues

in #hive-1679223 months ago

I watched some time ago the presentation of the GPT-4o model from Open AI. If you haven't seen it, this might be interesting for you.

Early today, I was curious to see the progression in time through the released versions of GPT from Open AI. Who better to ask than ChatGPT in this case? Here's what I got:

VersionDate Release
GPT-1June 2018
GPT-2November 2019 (partially released in February 2019)
GPT-3June 2020
GPT-3.5November 2022
GPT-4March 2023
GPT-4oMay 2024
GPT-5?

We can see Open AI had major releases every 6 months or so until they hit GPT-3, which made them stand out. Then they took it slow with GPT-3.5, perhaps scared of their own success (or not being ready for it). ChatGPT tells me they already offered API access since GPT-3. I thought they opened that up later, in GPT-3.5.

After another 6 months or so came GPT-4, which to me seems it has been worked on while work was being done on GPT-3.5. Also note that GPT-4 comes almost 3 years after GPT-3, while before, major releases came yearly.

The current version 4o of GPT was released about a year after GPT-4.

From a ranking (based on community votes, apparently) I found on Grok's X account (the xAI), it looks like GPT-4o is still considered the number 1 LLM, but Google's Gemini is the closest and, since the xAI bragged about it on their account, it's natural their model is doing well, and that's on the third spot:

We see strong competition is coming for Open AI and their GPT model. We also hear the next version GPT-5 will be another game changer. The question is, with the intense competition that is coming, will Open AI (or any of them) take the time to be cautious where they lead this technology? I certainly doubt that. They push at full throttle, all of them, in my opinion.

Open Ai has the advantage of still having a slight edge to competition, but the BIG disadvantage of not owning massive amounts of data to train their models. While Google, in particular, Meta, and to some degree, X, don't have this issue. I wonder if X added long-form content for this reason...

There is definitely a race to acquire more quality data to train these models. At the rate these LLMs are developing, it is predicted they will run out of quality data by the end of the decade or early 30s. There are solutions after that, of course. But not as good as having access to massive amounts of quality data. By the way, "high-quality data" is the expression ChatGPT used when I asked it when LLMs will run out of training data.

I ran out of credits, or I would have asked it to elaborate on the internet as a training source:

A key factor is the rate at which new, high-quality data is generated and curated. Currently, models are trained on vast datasets that include large portions of the internet, books, academic papers, and other sources. However, as LLMs grow larger, they require more data to achieve further improvements, and eventually, the available quality data may become a limiting factor.

Posted Using InLeo Alpha

Sort:  

So my question is, why rush it? Like I feel like a 6 months interval is a actually very short to release different versions, why not build them to last while taking time to develop the next models?

Because it matters who is first. Before ChatGPT had some competition, it was THE generative AI everyone talked about and USED. Every big player wants a market share into this disruptive domain, thus the race.

That's what competition does, the rush just takes away the need for Quality..

The need remains. It just doesn't seem a priority when you have competition breathing over your shoulder or you are about to catch the one in front.

OpenAI is progressing so fast

I’m sure 4o of GPT will soon be changed too

Thank you for sharing this
Don’t you feel the major releases that every 6months is quite fast?

I don't know about others, who are earlier with their AI models, but Open AI doesn't release major versions every 6 months anymore. Probably the more advanced the model. the harder it is to release new major versions in rapid succession.

I kind of expected the newer models to take longer. It does take more data and refinement so I think it does make sense. In a way, that gives other platforms a chance to catch up too.

It makes sense, yes. Also, various blind spots of the previous problems become obvious as the model becomes more popular and better training was needed to fix the problems discovered.

This thing is developing very fast and benefiting people a lot so we will see more and more users in the near future.

They will surely change the world as we knew it up until a few years ago.

As the competition increases within the AI space, the need for safety and quality will probably the least of priorities for these companies as they role out new products. I think Google and Meta have a strong advantage with the amount of data generated by their platforms so far.

I think Google and Meta have a strong advantage with the amount of data generated by their platforms so far.

I agree. Amazon too, I hear they entered this race too.

safety and quality will probably the least of priorities for these companies as they role out new products

Unfortunately, that's the likely scenario we are in. But it's a dangerous path, given what they are developing.