Question Details

No question body available.

Tags

python large-language-model

Answers (2)

March 13, 2026 Score: 0 Rep: 1 Quality: Low Completeness: 80%

For medical domain LLMs where factuality is critical, here are the key GenerationConfig parameters I'd recommend based on practical experience:

Temperature: 0.1-0.3 (not 0)

Setting temperature to exactly 0 (greedy decoding) can cause repetitive loops and degenerate output, especially on longer sequences. A small value like 0.1-0.2 gives you near-deterministic output while maintaining coherence. For clinical note summarization specifically, I'd start with 0.1.

python
generationconfig = {
    "temperature": 0.1,
    "topp": 0.9,
    "topk": 40,
    "repetitionpenalty": 1.15,
    "maxnewtokens": 1024,
    "dosample": True
}

Top-p (nucleus sampling): 0.85-0.95

Keep this relatively tight. Lower topp restricts the token pool to high-probability tokens, reducing hallucination risk. For extraction tasks (oncology reports), use 0.85. For summarization where you need slightly more flexibility, 0.9-0.95 works well.

Repetition penalty: 1.1-1.2

This is critical for medical text where certain terms naturally repeat (drug names, anatomical terms). Too high (>1.3) and the model will actively avoid repeating medical terminology it should be repeating. 1.15 is a good sweet spot.

Practical strategies beyond GenerationConfig:

  1. Constrained decoding: For extraction tasks, consider structured output (JSON mode) to force the model into a schema that maps to your clinical fields.

  2. Self-consistency sampling: Generate N responses (e.g., N=5) with temperature=0.3, then take the majority answer. This significantly reduces hallucination on factual extraction tasks.

  3. RAG with medical knowledge bases: Pair your LLM with retrieval from PubMed/UMLS rather than relying on parametric knowledge alone.

Relevant papers:

  • "Calibrated Language Models for Clinical NLP" discusses temperature scaling for clinical domains
  • The original nucleus sampling paper (Holtzman et al., 2019) explains why topp outperforms topk for maintaining output quality
March 13, 2026 Score: 0 Rep: 1,137 Quality: Low Completeness: 0%

factual accuracy and consistency are far more critical than linguistic creativity = never use a LLM;