block parameter N
MoonshotAI/Attention-Residuals
just curious have you guys tried effects of varying block sizes within a single model (such as using smaller groups in earlier layers and larger groups in later layers)
View on GitHub ↗
Related Content
SaaS Metrics