Mixture of Experts (MoE)

An architecture where the model has many specialist sub-networks (experts) but only routes each token to a few of them. A 685B-parameter MoE might only activate 22B params per forward pass. Result: training/inference cost like a small dense model, knowledge capacity like a giant one. DeepSeek V4 and Mixtral 8x22B are MoE models.