← Back to Product Feed

GitHub Open Source MoonshotAI/Attention-Residuals

No tagline provided.

2,587
Traction Score
116
Forks
Mar 15, 2026
Launch Date
View Origin Link

Product Positioning & Context

AI Executive Synthesis
Exploring and evaluating a novel, data-dependent query formulation for Attention Residuals to potentially enhance its representational power and dynamic routing capabilities, moving beyond static query vectors.
This issue proposes an alternative, data-dependent query formulation for Attention Residuals, moving beyond the current static query vector. The proposed method involves calculating unnormalized routing scalars for future layers via an affine projection of $v_i$ at each layer, followed by softmax normalization and a sum reduction. This demonstrates active engagement with the core architectural design. For B2B SaaS developing foundational AI models, such theoretical explorations are critical for pushing performance boundaries. Investigating dynamic, data-dependent routing mechanisms could unlock significant improvements in model efficiency, capacity, or generalization, offering a competitive edge in the rapidly evolving AI landscape.

Active Developer Issues (GitHub)

open Featured Proposal:Supervisory Interface for Long-Horizon Interaction-Empirical Evidence from 180-Day LSO Trace
Logged: Mar 28, 2026
open block parameter N
Logged: Mar 25, 2026
open Talk is cheap. Show me the code.
Logged: Mar 19, 2026
open 合影
Logged: Mar 19, 2026
open 請問有實現的程式碼嗎?
Logged: Mar 18, 2026

Community Voice & Feedback

yisar • Mar 31, 2026
俺只是随手发了一个,大体就是,对残差链接做注意力也有文章,而且效果普遍都不是很好,却只字不提,之前的线性层那篇文章,也是和 RWKV 如出一辙,而且两篇一起看的话,作者明显是读过类似文章的
cho104 • Mar 31, 2026
I’m a bit confused by the flow of this thread.

The OP originally linked to the "DeepCrossAttention paper" (published Feb 10, 2025). Since that paper's concepts seem very closely related to this repository's "Attention Residuals" (published Mar 16, 2026), the OP's concern about a missing citation feels completely valid. (similar concern on X: https://x.com/behrouz_ali/status/2033581834953453853 )

The response however seems to focus on comparing Attention Residuals to Hyper Connection, bypassing DeepCrossAttention. I checked the edit history, and it doesn't look like the OP ever changed their URL...

Was there a mix up when reading the initial issue?
xxyh1993 • Mar 31, 2026
啊?咱们下载的不是同一篇技术报告?
chuanyang-Zheng • Mar 17, 2026
> https://arxiv.org/abs/2502.06785 和这篇几乎一样,但是文章中一点也不提及 之前也是这样 [MoonshotAI/Kimi-Linear#4](https://github.com/MoonshotAI/Kimi-Linear/issues/4)


Attention Residual是Layer Dimension的Quadratic Attention, Hyper Connection是Layer Dimension的Linear Attention. 这种关于layer维度的,之前有过一些 相关 工作,比如 [Learning Deep Transformer Models for Machine Translation ](https://arxiv.org/abs/1906.01787) , [Depth-Wise Attention (DWAtt)](https://arxiv.org/abs/2209.15168). 这种Attention Residual的工作相比于线性,可能在层数比较多(e.g. more than 1000)时候有一些明显效果提升相比于线性注意力(e.g. Hyper Connection)

Attention在层间流动的信息本质是在做gradient descent,Pre-Norm 和 Post-Norm是在流形上的优化,x是当前点,ffn(x)或者att(x)是对应的梯度,所以可以做很多有意思的东西。从这个角度理解 Attention Residual在做类似SGD+momentum。同时,我们也可以将信息流动建模成流形上的梯度下降 [GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization](https://arxiv.org/abs/2601.22095)

我对架构方向做的比较多,有问题可以多交流

Related Early-Stage Discoveries

Discovery Source

GitHub Open Source GitHub Open Source

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.