Answer to: Decoder only model AI making repetitive responses
Score: 2 • Accepted
I think the problem is that you’ve built a decoder layer with cross-attention to itself and you’re training on pad tokens. Both drive repetition.
1) Don’t use TransformerDecoder with memory=x
A decoder layer in PyTorch expects cross-attention to an encoder memory. You’re passing memory=x (the same sequence), so every block does self-attn and cross-attn to the same tokens, which encourages echoing. For GPT-style decoder-only, you want masked self-attention only.
Your forward does this now:
mask = torch.triu(torch.ones(t, t, device=x.device) * float('-inf'), diagonal=1)
out = self.decoder(tgt=x, memory=x, tgt_mask=mask) # <- cross-attends to itself
Switch to an encoder stack with a causal mask, or write a small GPT block. Easiest fix using the built-in encoder:
class CausalTransformer(nn.Module):
def __init__(self, vocab_size, block_size, d_model, n_heads, n_layers, dropout):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Parameter(torch.zeros(1, block_size, d_model))
self.drop = nn.Dropout(dropout)
enc_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=n_heads, dim_feedforward=4*d_model,
dropout=dropout, activation='gelu', batch_first=True
)
self.enc = nn.TransformerEncoder(enc_layer, num_layers=n_layers)
self.ln = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
self.block_size = block_size
def forward(self, x, key_padding_mask=None):
b, t = x.size()
tok = self.tok_emb(x) * math.sqrt(self.tok_emb.embedding_dim)
pos = self.pos_emb[:, :t, :]
h = self.drop(tok + pos)
# causal mask (upper-triangular -inf)
causal = torch.full((t, t), float('-inf'), device=x.device).triu(1)
h = self.enc(h, mask=causal, src_key_padding_mask=key_padding_mask)
h = self.ln(h)
return self.head(h)
2) Don’t learn “pad → pad”
You pad short chunks and then compute loss on every position, including pads. That trains the model to output <pad> repeatedly.
Fix the loss:
pad_id = tokenizer.token_to_id('<pad>') or 0
criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
And ideally pass a key padding mask so attention doesn’t attend to pads:
# build mask: True where padding is present
key_padding_mask = (xb == pad_id) # shape [B, T]
logits = model(xb, key_padding_mask=key_padding_mask)
loss = criterion(logits.view(-1, vocab_size), yb.view(-1))
3) Sampling tweaks reduce looping when the model is uncertain
During generation, add top-k/top-p and a mild repetition penalty:
@torch.inference_mode()
def generate(model, tokenizer, prompt, max_new_tokens=100, temperature=0.9, top_k=50, top_p=0.95, rep_penalty=1.1, device='cpu'):
model.eval().to(device)
ids = torch.tensor([tokenizer.encode(prompt).ids], dtype=torch.long, device=device)
for _ in range(max_new_tokens):
logits = model(ids)[:, -1, :] / max(1e-6, temperature)
# repetition penalty
for b in range(ids.size(0)):
logits[b, ids[b]] /= rep_penalty
# top-k / top-p filtering
probs = torch.softmax(logits, dim=-1)
if top_k is not None:
kth = torch.topk(probs, k=top_k)[0][:, -1].unsqueeze(-1)
probs = torch.where(probs < kth, torch.zeros_like(probs), probs)
if top_p is not None:
sorted_probs, sorted_idx = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
mask = cumsum > top_p
mask[..., 0] = False
sorted_probs[mask] = 0.0
probs = torch.zeros_like(probs).scatter(-1, sorted_idx, sorted_probs)
probs = probs / probs.sum(dim=-1, keepdim=True)
next_tok = torch.multinomial(probs, num_samples=1)
ids = torch.cat([ids, next_tok], dim=1)
return tokenizer.decode(ids[0].tolist())
Once you switch to a causal encoder stack (or a custom GPT block), mask pads in loss/attention, and add sane sampling, the “anarchism anarchism …” loops should disappear.
View Question ↗
Question
Parent Entity
Score: 2 • Views: 99
Site: stackoverflow
Other Comments / Reviews
SaaS Metrics