ROIpad ← Back to Search
stackoverflow › answer

Answer to: Recommended GenerationConfig for Medical Domain LLMs: Strategies to Minimize Hallucination and Ensure Factuality

Score: 0
Answered: Mar 13, 2026
User Rep: 1
For medical domain LLMs where factuality is critical, here are the key GenerationConfig parameters I'd recommend based on practical experience: Temperature: 0.1-0.3 (not 0) Setting temperature to exactly 0 (greedy decoding) can cause repetitive loops and degenerate output, especially on longer sequences. A small value like 0.1-0.2 gives you near-deterministic output while maintaining coherence. For clinical note summarization specifically, I'd start with 0.1. python generation_config = { "temperature": 0.1, "top_p": 0.9, "top_k": 40, "repetition_penalty": 1.15, "max_new_tokens": 1024, "do_sample": True } Top-p (nucleus sampling): 0.85-0.95 Keep this relatively tight. Lower top_p restricts the token pool to high-probability tokens, reducing hallucination risk. For extraction tasks (oncology reports), use 0.85. For summarization where you need slightly more flexibility, 0.9-0.95 works well. Repetition penalty: 1.1-1.2 This is critical for medical text where certain terms naturally repeat (drug names, anatomical terms). Too high (>1.3) and the model will actively avoid repeating medical terminology it should be repeating. 1.15 is a good sweet spot. Practical strategies beyond GenerationConfig: Constrained decoding: For extraction tasks, consider structured output (JSON mode) to force the model into a schema that maps to your clinical fields. Self-consistency sampling: Generate N responses (e.g., N=5) with temperature=0.3, then take the majority answer. This significantly reduces hallucination on factual extraction tasks. RAG with medical knowledge bases: Pair your LLM with retrieval from PubMed/UMLS rather than relying on parametric knowledge alone. Relevant papers: "Calibrated Language Models for Clinical NLP" discusses temperature scaling for clinical domains The original nucleus sampling paper (Holtzman et al., 2019) explains why top_p outperforms top_k for maintaining output quality
python large-language-model
View Question ↗