Alternative formulation for Attention Residuals, specifically a data-dependent query mechanism.
Raw Developer Origin & Technical Request
GitHub Issue
Mar 16, 2026
Hello 😀
I was reading your paper and came up w/ an idea for an alternate formulation I would like to see.
Your formulation uses a static query vector, instead of a true data dependent query formulation.
Why not go all in on this?
In this alternative formulation, at each layer $i$, calculate the unnormalized routing scalars for all future layers $l \in \{i+1, \dots, L\}$ via an affine projection of $v_i$:
$$s_i = W^{(i)} v_i + b^{(i)}$$
where $W^{(i)} \in \mathbb{R}^{(L-i) \times d}$ is the projection weight matrix and $b^{(i)} \in \mathbb{R}^{L-i}$ is the bias vector. For example, layer 3 explicitly calculates its specific contribution weights for layers 4, 5, ..., $L$.
When computing the input $h_l$ for a subsequent layer $l$, collect the scalar $s_{i \to l}$ provided by each earlier layer $i \in \{0, \dots, l-1\}$. Apply a softmax over these $l$ scalars to ensure competitive normalization, and compute the final representation via a sum reduction over the past values:
$$\alpha_{i \to l} = \frac{\exp(s_{i \to l})}{\sum_{j=0}^{l-1} \exp(s_{j \to l})}$$
$$h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} v_i$$
Developer Debate & Comments
No active discussions extracted for this entry yet.
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from MoonshotAI/Attention-Residuals.
合影
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like h_l and v_i by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
Market Trends