DeepSeek-V3: Technical Details
Title: DeepSeek-V3 Technical Report
Paper: https://arxiv.org/abs/2412.19437
Repository: https://github.com/deepseek-ai/DeepSeek-V3
While my previous post about DeepSeek was more general, today I'd like to delve into some technical solutions in DeepSeek that we haven't discussed before.
First off, what's important to know about DeepSeek-V3 is that it's still a relatively classical transformer decoder, but with Mixture-of-Experts (MoE). DeepSeek-V3 contains 671B total parameters, of which 37B are active for each token. It has 61 transformer layers with the hidden dimension, d_h=7168.
The paper includes several interesting solutions worth noting for historical context. Let's start with a couple of things verified in the previous version, DeepSeek-V2 (https://arxiv.org/abs/2405.04434).
Multi-head Latent Attention (MLA)
First, Multi-head Latent Attention (MLA). What is it?
In classical Multi-Head Attention (MHA), input token embeddings h_t are projected into query, key, and value vectors q_t, k_t, v_t through independent projection matrices Wq, Wk, Wv and then split into vectors for individual attention heads. After self-attention (the famous softmax(QK/sqrt(d))*V
), we get o_t for individual heads, concatenate them, and generate the layer output through matrix Wo.
MLA performs low-rank compression for keys and values, where h_t is first projected into a low-rank latent vector c_t, and then this vector is expanded into k_t, v_t through separate matrices Wuk, Wuv. The latent vector size, d_c, is much smaller than the final dimension considering all heads (d_h*n_h). During inference, this reduces the required KV-cache size because only low-dimensional c_t needs to be cached, rather than full-dimensional k_t, v_t as before.
Moreover, the projection matrices from c_t to keys and values can be eliminated entirely — the matrix for k_t (Wuk) can be incorporated into the matrix for obtaining q_t (Wq), and the matrix for v_t (Wuv) into the output matrix Wo.
In fact, q_t also undergoes low-rank compression into its own c_t vector. While this doesn't affect the KV-cache, it helps reduce the memory footprint for activations during training.
There was an issue with RoPE positional embeddings being incompatible with low-rank KV compression. To solve this, they proposed a decoupled RoPE strategy with additional multi-head qR and shared kR with their own dimension dR_h per head. The final vectors for Q and K are concatenations of vectors obtained from the corresponding low-rank vector c_t and the RoPE vector (qR, kR).
Look at the formulas (section 2.1.2), they're clearer than the text description.
In DeepSeek-V2, the latent vector dimension d_c was set to 4*d_h (total dimension of four heads), and the RoPE dimension dR_h to d_h/2 (half a head). In DeepSeek-V3's MLA, there are 128 attention heads, each with dimension 128. The d_c dimension is 512.
Remember that this isn't the only way to optimize attention for faster generation. Many have moved away from classical MHA to Multi-Query Attention (MQA) by Noam Shazeer (https://arxiv.org/abs/1911.02150), where K and V are shared across all attention heads (significantly speeding up inference with a slight quality degradation), and Grouped-Query Attention (GQA) also from Google (https://arxiv.org/abs/2305.13245), which was a middle ground between MHA and MQA. In GQA, the number of key-value heads was more than one but less than the full set as in query — here, one key-value head per group of query heads — and quality could approach the original MHA.
MLA efficiently saves cache space, comparable to GQA with 2.25 groups, while performance even exceeds MHA.
It seems that MLA should now dominate everywhere. I'm not aware of anything better that's been published.
DeepSeekMoE
Second, DeepSeekMoE (https://arxiv.org/abs/2401.06066), also used in DeepSeek-V2.
The "experts" reside in FFN layers, not in MLA, and the layer is replaced by selecting and calling some number of "experts" from all available ones. Essentially, each expert is a separate FFN layer selected by some routing algorithm. The classic GShard (https://arxiv.org/abs/2006.16668) activated two experts per layer, while Switch Transformer (https://arxiv.org/abs/2101.03961) used one. Accordingly, each token is sent for processing to selected experts, and if there's more than one, their responses are mixed in some way (for example, with weights).
DeepSeekMoE aims to achieve greater specialization from experts. For this, experts are divided into smaller ones. That is, each expert was split into m pieces, but also more experts are activsted, also m times, so the total computations remain roughly the same. This is called Fine-Grained Expert Segmentation. Instead of K active experts from N, we get mK from mN. This results in more interesting combinatorics with many more possible variants of who can be involved, potentially leading to more interesting expert specialization.
On the other hand, some general knowledge might be required, and for this, it makes sense to allocate some shared experts that tokens are always sent to. This way, there's hope that general knowledge will be learned there rather than independently in many other experts. You could say that ultimately there are N_s shared experts and N_r routed experts.
DeepSeek-V3 uses 1 shared expert and 256 routed experts, from which 8 are selected as active.
Routed experts are selected as top-k based on an affinity score, calculated as the dot product of the input token embedding and a specific expert's centroid. I didn't notice a description of how this centroid is computed, but I assume it's some average value of activations (or inputs) of all tokens that this expert responds to.
In DeepSeek-V2, they took softmax of this product:
in DeepSeek-V3, they switched to sigmoid and also added normalization of all scores before applying them:
To avoid routing collapse (for example, when everything is sent to the same experts), DeepSeek-V2 had a special balancing loss, actually two: one at the expert level, another at the computational device level, which makes sense as balance is desired in both places.
Too large a loss can hurt model performance, and in DeepSeek-V3, they abandoned the additional loss, using a special auxiliary-loss-free load balancing strategy, published by the team slightly earlier (https://arxiv.org/abs/2408.15664). In it, a bias is added to the affinity score during routing, and top-k is taken from the result. This bias isn't used for computing the expert mixing coefficient (gating value).
A special procedure controls the bias change, monitoring which experts were called within the batch, and if someone is overloaded, their bias is lowered (and raised if an expert sits idle). Works better than with loss.
Interesting move away from backprop. Although maybe they just haven't found the right approach for training with backprop...
To avoid imbalance within the processed sequence, they also added a Complementary Sequence-Wise Auxiliary Loss with a very small weight. There's an algorithmic Node-Limited Routing that limits devices, conceptually similar to the balancing loss in DeepSeek-V2. Each token is sent to a maximum of 4 nodes.
Multi-Token Prediction (MTP)
Next, new features. Multi-Token Prediction (MTP) is used. The idea of MTP is to predict more than one token at each position. In the current model, it's two tokens — the current and the next one. In theory, this strengthens the training signal and can improve data efficiency. It can also help the model better prepare for predicting future tokens.
Token prediction is made sequential. For predicting D additional tokens, D MTP modules are used, which have shared embeddings and output head. They receive the output from the main model layer or the previous MTP module, plus embeddings of the next token, everything is normalized with RMSNorm and concatenated. Each module calculates cross-entropy loss, the average loss is computed across all modules, and it acts as an additional model loss with coefficient λ (0.3 for the first 10T tokens, 0.1 for the subsequent 4.8T). During inference, MTP modules are discarded, but they can be used for speculative decoding.
MTP consistently improves performance on most benchmarks. In experiments, the acceptance rate for the next token ranged from 85% to 90%. In combination with speculative decoding, TPS increases 1.8 times.
Infrastructure
Another interesting part is the infrastructure.
DeepSeek-V3 was trained on a cluster of 2048 NVIDIA H800 GPUs. Reminder: H800 is a limited H100 for the Chinese market. The H800 has reduced interconnect (bandwidth is more than halved and the number of NVLink links is also reduced), and FP64 FLOPS are reduced by tens of times — unimportant for neural networks but worse for calculating atomic bombs. For "particularly logical" numbering, H200 is an improved version of H100 with larger volume of faster memory.
For training, the company wrote their own closed-source framework HAI-LLM.
DeepSeek-V3 uses 16-way Pipeline Parallelism (PP), 64-way Expert Parallelism (EP) with 8 nodes, and ZeRO-1 Data Parallelism (DP). For efficient PP, they developed the DualPipe algorithm, overlapping communication and computation phases in forward and backward passes. This leads to reduced pipeline bubbles. Thanks to severe memory optimizations, they managed without Tensor Parallelism (TP). They also developed efficient cross-node all-to-all communication kernels.
FP8 Training
The most interesting part.
For those who don't know what FP32, FP16, BF16 are, welcome to my old post. FP8 isn't there, but by analogy, you'll understand what it is.
This seems to be the first open really large production model trained in FP8. Llama3, for example, was apparently trained in BF16, and I understand this is roughly the standard, or a mix of FP32/16. Yes, there was an earlier work (https://arxiv.org/abs/2409.12517) from Israeli researchers at Habana (now Intel). They trained a 7B model on 2T tokens on Intel-Habana's Gaudi2 and achieved quality comparable to BF16 while improving throughput by 34%. There was also an even earlier FP8-LM (https://arxiv.org/abs/2310.18313) from Microsoft, where they trained GPT-175B. They even published a library (https://github.com/Azure/MS-AMP). I wouldn't be surprised if OpenAI has also internally switched to FP8 (maybe at least for some models), but they're silent about it. What Google is doing is also unclear. But I'm betting on BF16 🙂
In reality, DeepSeek also uses mixed precision — some things are still computed in more complete formats, BF16 or even FP32. These formats remained for: embedding module, the output head, MoE gating modules, normalization operators, and attention operators. They also write master weights, weight gradients, and optimizer states in higher precision. All this increases training stability, seemingly the main problem of low-precision formats (beyond the lack of support in kernels and hardware). But most heavy computations are in FP8.
Partly because of this, I think, they managed to save significantly on compute costs. In perfect theory, this doubles available compute while simultaneously halving memory requirements.
Along the way, they implemented several strategies to improve precision, for example, more sophisticated quantization, increased precision for accumulation, and prioritizing mantissa over exponent, thanks to which E4M3 format (4 bits for exponent and 3 for mantissa) is used for all tensors, rather than a mix of E4M3 and E5M2.
They also invested in storage and communication optimization, which helped save both in memory consumption and communication overhead.
FP8 training was validated on DeepSeek-V2 with 16B and 230B parameters, where the difference between FP8 and BF16 was within random variation.
Now we're waiting for America to require Nvidia to limit FP4 and FP8 🙂
Inference
They also made optimizations for inference.
The deployment of prefill and decode phases is separated. As a reminder, during prefill all prompt tokens are processed and all intermediate KVs are computed, while during decode autoregressive token-by-token generation occurs. More details here.
For prefill, the minimum deployment unit contains 4 nodes with 32 GPUs and specific parallelism settings. For decoding, where a total of 9 experts are needed, the minimum unit contains 40 nodes with 320 GPUs and has its own settings.
Suggestions on Hardware Design
A separate interesting section is "3.5. Suggestions on Hardware Design."
I haven't seen similar sections in other papers, but maybe they exist somewhere. Please share good examples if you know any. This is really cool, the co-evolution of software and hardware in all its glory — now someone needs to implement it. In China, I think they will.
Among the recommendations, there are groups about communication and compute.
Communication took up 20 out of 132 SMs that could have been doing computations. The authors would like to use a GPU co-processor or special network co-processor to offload such tasks. Remember Intel 386/387 and later, when there were processors and arithmetic co-processors? Now graphics processors and co-processors are maturing! Although it seems they've long existed, like DPUs? From a programming perspective, it would be interesting to unify Infiniband (scale-out) and NVLink (scale-up) networks.
From the compute perspective, there are requests for increasing accumulation precision within tensor cores, support for tile- and block-wise quantizations, online quantization, and support for transposed GEMM operations.
I'll end the technical analysis here for now, maybe we'll go through the training and subsequent models later.