Deepseek made headlines with its claims about its model and the cost to train it. We saw a massive shift along with volatility in the financial markets as many questioned the path Big Tech is taking.
Are they now obsolete? Not exactly.
That said, there were some engineering feats accomplished by the Deepseek team. However, the full accurateness of the claims are being disputed.
One irony that came out of this was OpenAI pushing back, stated that Deepseek was the result of distillation of ChatGPT. This is funny since OpenAI trained the early versions on nothimg more than data scraped from different websites.
Talk about the pot calling the kettle black.
Today, the practice is diminishing a bit since many sites are locking down the robots.txt, the filename associated with crawlers.
As always, this centers around, in large part, the question for data. It is an ongoing battle.
Hive Can Supplement The Distillation Process
Hive and other permissionless databases can help in this endeavor.
There are basically two choices: have data in the hands of Big Tech, i.e. those with the major platforms, or distribute the data onto networks that allow open access.
Hive is one such network. Anyone is free to set up an API and interact with the databases.
Before getting into that, let us see what was done with Deepseek with regards to distillation.
This came from VeniceAI:
Distillation in AI, often referred to as knowledge distillation, is a technique that transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). This process allows the smaller model to maintain similar performance while being easier to deploy and requiring less computational power.
Basically, ChatGPT was queried about a range of topics, providing output that was accumultated by the team. It was a process that was likely repeated millions of times, pulling in a great deal of data.
This helped to solve the problem of data. It was also probably structured in a reasoned form since the larger models all are shifting towards that.
The engineers behind Deepseek were able to create their own algorithms along with weights to train the data. Other sources could have been drawn upon and integrated. This will follow a similar reasoning format that OpenAI built, with the student mimicking the teacher.
Proprietary Data
We are still in a world of proprietary data. Even though Deepseek is open source, the data used is not being shared. This is the case for other open models such as Llama. The weights are there for all to see but the data is not moving from Meta servers.
The results of the prompts instigated by Deepseek are available to both that company and OpenAI. Nobody else has access.
Hence, one of the reasons to use distillation is to get data rapidly, even if synthetic. Reasoning models are showing that they can excel in particular areas on this type of data. It was something that was hotly debated until recently. The reasoning models are getting high marks using synthetic data in cases where there is a correct answer. This means topics such as math and physical sciences are seeing results.
Where they can fall short is in the more "artsy" areas. Here is where the human touch is still required. Therefore, the need for more than just synthetic data is crucial.
For this reason, I think that social media platforms feeding into these models have an advantage. Under this scenario, we are seeing a combination of human and synthetic. In fact, the humans are often engaging with the data.
The VeniceAI Model
VeniceAI made news last week with the release of their token. This is the first that is tied to a platform which is exclusively a chatbot. Perhaps the timing was not the best due to the fact the markets are being crushed due to the tariffs implemented by the Trump administration.
That said, we are seeing the idea of AI and crypto starting to play out. VencieAI uses its token for access. Holders can stake it which equates to VCU (Venice Computer Units), allowing for interaction with the chatbot. It is similar to the resource credit system on Hive.
Where VencieAI takes a different approach is in the fact they are privacy driven. According to the team, they do not store the data regarding the prompts. They designed it where that is stored locally on the browser. We do have to bear in mind that their word has to be taken since there is no proof this is the case.
Presuming they are being genuine, this adds a different spin. The privacy feature is valuable but it also counters the idea of more data generated. In other words, this is lost. It is understandable that not all prompts should be public. However, if we think about what most people discuss, it is not exactly high security secrets.
The spectrum of AI and data is huge. There is room for plenty of services to appear. While some might focus upon privacy, others can concentrate on public data. This is where permissionless, decentralized databases can assist.
The Democratization of Data
We are seeing a push by the AI participants to get "as many eggs in their baskets" as possible. It likely is a "mine" world.
To me, this simply favors the major players. We are not going to keep pace with Big Tech by feeding them more. Sure, OpenAI will complain when distilation is done using their model but that changes nothing. Even Deepseek is still 7 or 8 months, behind according to many estimates.
The compute problem is tough to overcome. This is compounded by the data issue. Bascially, the models improve as more compute (and data) is put forth. A 10x in compute is going to have a major impact, albeit not on a 1:1 ratio. Under this circumstance, perhaps a 3x in the model is realized.
It is a back and forth race.
We have the battle for more, both compute and data, Then we have innovation, with each level of the stack drawing attions. Algorithms are improving and capabilities expanding that make existing data, synthetic or otherwise, more valuable.
And then we have the expected move into embedded AI, where large volumes of data will be acquired through the sensors on cars, robots, and other devices that move through our environment.
Here again, we have something that will likely benefit Big Tech since it takes a lot to get into the robotics game.
The winner take most proposition could be spreading to the real world. That is something that we must fight, in all ways possible.
Posted Using INLEO