DeepSeek seems to have taken the world by storm, including AI specialists. I kind of knew this was coming, because I heard from the CEO of a small Chinese startup (formerly working for Google, if I remember correctly), that they were working on optimizing costs of the AI models, which he was surprised none of the American corporations were doing. That is a different company than DeepSeek, by the way, after checking on the CEO of DeepSeek.
So what has DeepSeek accomplish? Many things, to be honest.
Ideogram's magic prompt did most of the work here...
It is said they trained their top AI model with only 6m US dollars, and using a limited number of chips compared to their Silicon Valley AI giant counterparts. Both the amount of money and the number of chips seem to be disputed, or it is an attempt to allow the big players in the AI race to save face, for the time being.
Their top model DeepSeek R1 rivals and is even better on certain tests with OpenAI's ChatGPT o1, the top LLM model until now.
On the technical side, Deep Seek also doesn't rate limit the APIs. That may cause delays of the API requests, but that brings a different approach in the market of APIs for AI models. Their server does close the connection if the request isn't given a proper answer within 30 minutes. All the technical and no-so-technical details can be found in DeepSeek R1's Github repository.
They open sourced all their models (including the parameters), and the license they are using permits commercial use or distilling their models (we'll talk below what that is).
On the other side of the world, Meta also open sourced Llama's code - which seemed like a bold move in the corporate AI race - but it has restrictions for certain commercial use and training data is not disclosed.
It's interesting that DeepSeek has a few distilled models based on Llama 3 and a few others based on Qwen, the AI models built by Alibaba Cloud.
I've mentioned distilled AI models a few times already. Let's see what that means.
What Is LLM Distillation?
LLM distillation is a technique of transferring knowledge from a larger pre-trained model (the "teacher") to a smaller model ("the student"). Source
The advantage is you can train a model on larger datasets and then extract from it specific knowledge and let the smaller model predict more by mimicking the larger model. An example probably everyone knows is OpenAI's o1-mini as a distillation of the larger o1 model.
But for open source LLMs, scientists or other categories of users can train smaller models specific to their needs starting from the top model.
Where DeepSeek goes further than most competitors is that it allows anyone to distill their own models with no restrictions.
What is the Mixture of Experts (MoE) Architecture for LLMs?
Newer LLM models have what is called a Mixture of Experts (MoE) architecture. That means that a full neural network is segmented in different subnets, with various specializations (called experts).
The MoE architecture consists of two layers, a Gating Network and the experts in the MoE model.
How does this architecture work?
A prompt from the user is processed by the Gating Network which decides in real time for which expert this request is and directs it to it. As a result, the subnet associated with the selected expert becomes active while the rest are inactive, reducing the compute needs.
Also, experts can be trained individually on top of the whole network, which should make them more accurate for their domain of expertise.
My question here would be what happens if a request needs a multi- or inter-disciplinary expertise? Is only one expert still chosen or are all of the needed ones activated?
DeepSeek v3 and DeepSeek R1 both have a MoE arhitecture. None of the OpenAI and Claude models they were compared against have such an architecture at this time.
Final Considerations
If anything, DeepSeek showed the dominant AI giants are not so dominant, and opening things up practically completely show a remarkable strategy to attempt to use decentralization as a way to fight existing giants. This is what people should think about, especially those who like gated platforms, wherever they may be.
Of course, it's been quite a while since the last OpenAI model was released. I don't remember when the last Claude model was released. But OpenAI should be relatively close to releasing a new model. That doesn't diminish in any way what DeepSeek has accomplished, and it doesn't really matter where they are from. It matters what they are doing and the route they've chosen, sure, probably forced by export bans on top-of-the-line Nvidia chips from the US to China.
Let's see who moves next...
Posted Using INLEO