view reply Interesting. If you compare only the model layer parameters to their performances, are any different trends seen? Thank you for responding.
view reply Can you discuss the additional things contributing to param count that were not mentioned in your article? embedding/head size, intermediate dimensions? I'm struggling to see how a 768x4 has equivalent params to a 512x12.