Considering a different formulation

MoonshotAI/Attention-Residuals

Status: Open

Opened: Mar 16, 2026

Hello 😀 I was reading your paper and came up w/ an idea for an alternate formulation I would like to see. Your formulation uses a static query vector, instead of a true data dependent query formulation. Why not go all in on this? In this alternative formulation, at each layer $i$, calculate the unnormalized routing scalars for all future layers $l \in \{i+1, \dots, L\}$ via an affine projection of $v_i$: $$s_i = W^{(i)} v_i + b^{(i)}$$ where $W^{(i)} \in \mathbb{R}^{(L-i) \times d}$ is the projection weight matrix and $b^{(i)} \in \mathbb{R}^{L-i}$ is the bias vector. For example, layer 3 explicitly calculates its specific contribution weights for layers 4, 5, ..., $L$. When computing the input $h_l$ for a subsequent layer $l$, collect the scalar $s_{i \to l}$ provided by each earlier layer $i \in \{0, \dots, l-1\}$. Apply a softmax over these $l$ scalars to ensure competitive normalization, and compute the final representation via a sum reduction over the past values: $$\alpha_{i \to l} = \frac{\exp(s_{i \to l})}{\sum_{j=0}^{l-1} \exp(s_{j \to l})}$$ $$h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} v_i$$

Unknown

View on GitHub ↗