Great paper, thanks for sharing it and your thoughts, Grisha.
The sad thing is that we still look at softmax merely as a "handy normalizer tool".
Softmax has much deeper significance. It is a generalization of the logistic function. So, when we use softmax, we unwittingly delve (ChatGPT, ha?) into these big things:
1. We treat data as being from a multinomial distribution.
2. We sculpt and chisel the neural network to act during training as a system of differential equations, more specifically as a replicator dynamics system (see replicator equation) and also as:
3. ...the Gibbs-Boltzmann distribution from statistical physics (where temperature is so natural, right?)
Last but not least, it is sad that we still bind ourselves to the process "train then deliver an inference-only model". We are still far away from open-endedness.
Great paper, thanks for sharing it and your thoughts, Grisha.
The sad thing is that we still look at softmax merely as a "handy normalizer tool".
Softmax has much deeper significance. It is a generalization of the logistic function. So, when we use softmax, we unwittingly delve (ChatGPT, ha?) into these big things:
1. We treat data as being from a multinomial distribution.
2. We sculpt and chisel the neural network to act during training as a system of differential equations, more specifically as a replicator dynamics system (see replicator equation) and also as:
3. ...the Gibbs-Boltzmann distribution from statistical physics (where temperature is so natural, right?)
Last but not least, it is sad that we still bind ourselves to the process "train then deliver an inference-only model". We are still far away from open-endedness.