The Generative AI Control Center: Monitoring and Validating Clinical LLMs

Generative AI models are rapidly entering clinical practice, promising to transform healthcare by reducing administrative burden and improving efficiency. But there’s a problem: once these models go live, we rarely know how well they continue to perform—or whether they remain safe, ethical, and trustworthy over time. Therefore, the REAiHL Lab team from Erasmus MC presented a compelling challenge during the SAS Benelux HackSprint at the KNVB Campus in Zeist: How can we monitor and evaluate Large Language Models (LLMs) to ensure their safe, ethical, and effective use in clinical settings?

The team developed the Generative AI Control Center — a hospital-wide framework and dashboard that continuously monitors and validates Large Language Models (LLMs) used in clinical settings, such as ambient AI scribes that capture and summarize doctor–patient conversations.

Four Dimensions of Evaluation

To ensure a holistic and responsible approach, the “Generative AI Control Center” dashboard evaluates LLMs across four key domains:

Performance: assessing accuracy, conciseness, structure, and clarity of AI-generated summaries, while detecting inconsistencies or hallucinations.
Ethical Evaluation: monitoring fairness, privacy, transparency, and compliance with data protection and AI governance standards.
Clinical Impact: capturing clinician and patient feedback to evaluate patient satisfaction, workload reduction, and overall experience.
Sustainability and Value: measuring energy and resource use and tracking the long-term cost-effectiveness of AI systems in daily clinical practice.

Building the Dashboard

Using SAS Viya technology, the prototype dashboard combines transcript data, AI-generated summaries, and survey feedback from clinicians. A key innovation is the “LLM-as-a-judge” module, which automatically assesses summary quality. By combining this with sentiment analysis, the dashboard computes both performance and ethical metrics.

This intuitive, interactive dashboard gives evaluators and clinicians an at-a-glance overview of each model’s strengths, risks, and trade-offs—making AI evaluation not only transparent but actionable. The team is also exploring an explanatory LLM that can interpret and contextualize results for deeper insights.

Looking Ahead

Next steps include:

Finalizing and visualizing the LLM-as-a-judge module.
Improving the dashboard design and usability.
Adding automatic bias detection and expanding sustainability metrics.

Long-term goals involve scaling the framework beyond AI scribes and establishing a maintenance team to oversee LLM implementations across hospital departments.

Why It Matters

Without robust monitoring, even promising AI tools may never reach the bedside. By uniting expertise in medical engineering, data science, and AI ethics, and drawing on experience building ICU dashboards with SAS software, the REAiHL Lab team excels at turning complex data into actionable insights for healthcare professionals. Their approach bridges the gap between innovation and implementation, ensuring that AI in healthcare is not only effective, but also ethical, sustainable, and trusted.

About the REAiHL Lab

The REAiHL Lab is a joint initiative of Erasmus MC, TU Delft, and SAS, united by a single mission: conduct research, design and implement AI systems, and translate the World Health Organization (WHO) ethical principles into clinically (and technically) feasible principles that can guide the development and deployment of AI technologies in healthcare.

By combining technical depth with ethical awareness, the REAiHL Lab aims to make AI in healthcare transparent, trustworthy, and truly usable at the bedside.

The Generative AI Control Center: Monitoring and Validating Clinical LLMs

Registration is open

SAS AI and Machine Learning Courses