ROIpad ← Back to Search
stackoverflow › answer

Answer to: Decoder only model AI making repetitive responses

Score: 2 • Accepted
Answered: Oct 30, 2025
User Rep: 26
I think the problem is that you’ve built a decoder layer with cross-attention to itself and you’re training on pad tokens. Both drive repetition. 1) Don’t use TransformerDecoder with memory=x A decoder layer in PyTorch expects cross-attention to an encoder memory. You’re passing memory=x (the same sequence), so every block does self-attn and cross-attn to the same tokens, which encourages echoing. For GPT-style decoder-only, you want masked self-attention only. Your forward does this now: mask = torch.triu(torch.ones(t, t, device=x.device) * float('-inf'), diagonal=1) out = self.decoder(tgt=x, memory=x, tgt_mask=mask) # <- cross-attends to itself Switch to an encoder stack with a causal mask, or write a small GPT block. Easiest fix using the built-in encoder: class CausalTransformer(nn.Module): def __init__(self, vocab_size, block_size, d_model, n_heads, n_layers, dropout): super().__init__() self.tok_emb = nn.Embedding(vocab_size, d_model) self.pos_emb = nn.Parameter(torch.zeros(1, block_size, d_model)) self.drop = nn.Dropout(dropout) enc_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=n_heads, dim_feedforward=4*d_model, dropout=dropout, activation='gelu', batch_first=True ) self.enc = nn.TransformerEncoder(enc_layer, num_layers=n_layers) self.ln = nn.LayerNorm(d_model) self.head = nn.Linear(d_model, vocab_size, bias=False) self.block_size = block_size def forward(self, x, key_padding_mask=None): b, t = x.size() tok = self.tok_emb(x) * math.sqrt(self.tok_emb.embedding_dim) pos = self.pos_emb[:, :t, :] h = self.drop(tok + pos) # causal mask (upper-triangular -inf) causal = torch.full((t, t), float('-inf'), device=x.device).triu(1) h = self.enc(h, mask=causal, src_key_padding_mask=key_padding_mask) h = self.ln(h) return self.head(h) 2) Don’t learn “pad → pad” You pad short chunks and then compute loss on every position, including pads. That trains the model to output <pad> repeatedly. Fix the loss: pad_id = tokenizer.token_to_id('<pad>') or 0 criterion = nn.CrossEntropyLoss(ignore_index=pad_id) And ideally pass a key padding mask so attention doesn’t attend to pads: # build mask: True where padding is present key_padding_mask = (xb == pad_id) # shape [B, T] logits = model(xb, key_padding_mask=key_padding_mask) loss = criterion(logits.view(-1, vocab_size), yb.view(-1)) 3) Sampling tweaks reduce looping when the model is uncertain During generation, add top-k/top-p and a mild repetition penalty: @torch.inference_mode() def generate(model, tokenizer, prompt, max_new_tokens=100, temperature=0.9, top_k=50, top_p=0.95, rep_penalty=1.1, device='cpu'): model.eval().to(device) ids = torch.tensor([tokenizer.encode(prompt).ids], dtype=torch.long, device=device) for _ in range(max_new_tokens): logits = model(ids)[:, -1, :] / max(1e-6, temperature) # repetition penalty for b in range(ids.size(0)): logits[b, ids[b]] /= rep_penalty # top-k / top-p filtering probs = torch.softmax(logits, dim=-1) if top_k is not None: kth = torch.topk(probs, k=top_k)[0][:, -1].unsqueeze(-1) probs = torch.where(probs < kth, torch.zeros_like(probs), probs) if top_p is not None: sorted_probs, sorted_idx = torch.sort(probs, descending=True) cumsum = torch.cumsum(sorted_probs, dim=-1) mask = cumsum > top_p mask[..., 0] = False sorted_probs[mask] = 0.0 probs = torch.zeros_like(probs).scatter(-1, sorted_idx, sorted_probs) probs = probs / probs.sum(dim=-1, keepdim=True) next_tok = torch.multinomial(probs, num_samples=1) ids = torch.cat([ids, next_tok], dim=1) return tokenizer.decode(ids[0].tolist()) Once you switch to a causal encoder stack (or a custom GPT block), mask pads in loss/attention, and add sane sampling, the “anarchism anarchism …” loops should disappear.
python deep-learning pytorch
View Question ↗
Question
Parent Entity
Score: 2 • Views: 99
Site: stackoverflow