Causal Self Attention implementation and auto-grading correctness.
Raw Developer Origin & Technical Request
GitHub Issue
Apr 5, 2026
Using both of the implementations below passes all the checks in 9th problem - Causal Self Attention -
```
def causal_attention(Q, K, V):
B, seq, d_k = Q.size()
mask = 1 - torch.triu(torch.ones(seq, seq), diagonal=1)
scores = (torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k)) * mask
scores[scores == 0] = float('-inf')
attention = torch.bmm(torch.softmax(scores, dim=-1), V)
return attention
```
```
def causal_attention(Q, K, V):
B, seq, d_k = Q.size()
mask = 1 - torch.triu(torch.ones(seq, seq), diagonal=1)
scores = (torch.bmm(Q, K.transpose(1, 2)) / d_k) * mask
scores[scores == 0] = float('-inf')
attention = torch.bmm(torch.softmax(scores, dim=-1), V)
return attention
```
The difference above being a `math.sqrt(d_k)` instead of `d_k`.
Developer Debate & Comments
No active discussions extracted for this entry yet.
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from duoan/TorchCode.
Frequently Asked Questions
Market intelligence mapped to Causal Self Attention implementation and auto-grading correctness..
What problem does Causal Self Attention implementation and auto-grading correctness. solve?
Which technical concepts are associated with Causal Self Attention implementation and auto-grading correctness.?
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like V and K by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
SaaS Metrics