But there are a few really interesting and new insights:
With transformer models above 6.7B parameters, a "phase shift" (their language) occurs where features (a dimension that "offers some weak explanation for the label") are shared between layers (in that all the layer agree which dimension to use for that feature).
This is really important because these key features are where the "knowledge" of the neural network is concentrated. The attention layers are very sparse ("Almost all sequence dimensions have zero probability.)
But the fully connected layers are very dense. It compares them to computer vision where fully connected layers can be pruned of 95% of the weights without serious impact, while a transformer after this 6.7B param point can only be pruned of 5% of the weights.
And this is really interesting:
> Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.
The possibility of training hundreds of billion param networks in 8-bit (or less!) precision would be a real breakthough.
 for example increased the number 10-fold. Papers like  have pushed this complexity per synapse into a much higher level of complexity (so much they don't even put a number on it).
I'm a software engineer with a passing interest in this stuff, and my eyes glazed over.
> The only way to improve quantization is through more normalization constants. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. We can increase precision, by squishing each vector only as much as is needed. For example, if you have the two vectors:
> [3, 1, 2, 3]
> [0, 2, 2, 0]
> Then you can squish the first by 4 and the second by 2. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type.
If you'd have squished both by 4, the second vector would be [0, 0.5, 0.5, 0], leaving essentially half the quantization space (0.5-1.0) unused, and leaving you with less precision.
This is what the author means by "This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type."?
> Let’s do an example. Let’s say we have the vector [3, 1, 2, 3] in I5, and we want to quantize to I3.
> We see that our dequantization and quantization led to two errors:
[3, 1, 2, 4] to [3, 0, 2, 3]
The author changed [3, 1, 2, 3] to [3, 1, 2, 4].