← Back to Research Radar
Academic Publication Academic Publication

Testing and Evaluation of Health Care Applications of Large Language Models

406
Citations
January 28, 2025
Published Date

Research Abstract & Technology Focus

ImportanceLarge language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.ObjectiveTo summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.Data SourcesA systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Study SelectionStudies evaluating 1 or more LLMs in health care.Data Extraction and SynthesisThree independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.ResultsOf 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Conclusions and RelevanceExisting evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Read Full Literature

Correlated Market Trend: A/b Testing

Bridging academia to market: The 60-day public search velocity mapping directly to the core technology of this paper. Dashed line represents 7-day moving average.

AI Semantic Synergy Context

Connecting this academic literature to real-world market discussions and products.

crossref.org › academic paper
31%
🔥

Large Language Models in Healthcare and Medical Domain: A Review

The deployment of large language models (LLMs) within the healthcare sector has sparked both enthusiasm and apprehension. These models exhibit the remarkable ability to provide proficient responses...

crossref.org › academic paper
16%

Testing and Evaluation of Health Care Applications of Large Language Models

ImportanceLarge language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.ObjectiveTo summ...

crossref.org › academic paper
0%

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Abstract Clinical decision-making is one of the most impactful parts of a physician’s responsibilities and stands to benefit greatly from artificial intelligence solutions and lar...

crossref.org › academic paper
0%

A Survey on Evaluation of Large Language Models

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role...

crossref.org › academic paper
0%

Large Language Model Influence on Diagnostic Reasoning

ImportanceLarge language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such ...

Frequently Asked Questions (FAQ)

Curated market intelligence mapped to this research.

What is the core focus of the research titled 'Testing and Evaluation of Health Care Applications of Large Language Models'?

This literature focuses on: ImportanceLarge language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.ObjectiveTo summarize existing evaluations of LLMs in health care ...

Are there open-source GitHub repositories related to Testing and Evaluation of Health Care Applications of Large Language Models?

Yes, open-source projects like sooryathejas/METATRON (AI-powered penetration testing assistant using local LLM on linux (Parrot OS)) are actively building upon these concepts.

Which startups are commercializing the technology behind Testing and Evaluation of Health Care Applications of Large Language Models?

Products like FoodHealth Score are bringing this to market. Their focus is: Find healthier groceries while you shop online.

What other academic literature is closely related to 'Testing and Evaluation of Health Care Applications of Large Language Models'?

Yes, highly correlated activity was mapped. An entry titled 'Large Language Models in Healthcare and Medical Domain: A Review' discusses this: The deployment of large language models (LLMs) within the healthcare sector has sparked both enthusiasm and apprehension. These models exhibit the ...

Cite this Market Intelligence Report

Reference our AI-mapped synergy between this research and the commercial market to instantly build authority.

Commercial Realization

Startups and Open Source tools heavily associated with the concepts explored in this paper.

Associated Media Narrative