Training a 1 trillion parameter model
This episode discusses Kimi K2 and Moonshot AI's history, avoiding loss spikes during training, the muon optimizer, and data parallelism in the context of training a 1 trillion parameter model.
This episode discusses Kimi K2 and Moonshot AI's history, avoiding loss spikes during training, the muon optimizer, and data parallelism in the context of training a 1 trillion parameter model.