HealthBench is a framework that evaluates AI health answers from the perspective of medical professionals, built by more than 260 physicians together with OpenAI. It assesses the safety and clarity of an answer, and whether it recommends follow-up care with a clinician when needed.

How much did performance improve?

Over recent months, OpenAI's frontier models improved 28% on HealthBench, a larger leap than the gap between GPT-4o and GPT-3.5 Turbo in August 2024.

What does ChatGPT Health do?

Launched in January 2026, ChatGPT Health integrates medical records and wellness apps such as Apple Health and MyFitnessPal to help explain test results, prepare for appointments, build workout routines, and compare insurance. Health conversations are protected with dedicated encryption and isolation.

OpenAI Strengthens ChatGPT's "Health Intelligence"

OpenAI announced in June 2026 that it has sharpened ChatGPT's ability to answer health questions. The headline is a 28% performance gain on HealthBench, an evaluation standard built with more than 260 physicians — a bigger leap than the jump from GPT-3.5 Turbo to GPT-4o. Because trust in medicine is inseparable from safety, the company emphasized safeguard design as much as raw performance.

How to Read the 28% Figure

OpenAI said it has improved both the accuracy and the safety of ChatGPT's health-related answers. HealthBench Professional, unveiled in April 2026 by co-founder Greg Brockman, evaluates AI on real clinical tasks such as symptom assessment and treatment recommendations. Over recent months, OpenAI's frontier models improved 28% on this benchmark — a larger leap than the gap between GPT-4o and GPT-3.5 Turbo in August 2024.

The number deserves a closer look. This is not a headline-grabbing generational jump; it is cumulative progress over "recent months." That a single domain can move this far without a wholesale architecture overhaul suggests the performance curve is shifting away from general intelligence toward domain-specific tuning.

The Scorecard Is the Spec

HealthBench is an evaluation framework that scores AI health answers from the perspective of medical professionals. More than 260 physicians built it together with OpenAI. Its criteria assess the safety and clarity of an answer, along with whether the model recommends follow-up care with a clinician when appropriate. In other words, it measures not just raw accuracy but whether the AI "guides safely."

A shift in method is visible here. Older benchmarks asked whether the model got the right answer. HealthBench asks whether it avoided a harmful one, and whether it hands off to a human when it doesn't know. It scores what the AI declines to do, not just what it knows — and in high-stakes domains, that is where trust is actually built.

Integration and Isolation: Two-Way Design

Launched in January 2026, ChatGPT Health integrates a user's medical records and wellness apps. By connecting services like Apple Health and MyFitnessPal, it helps explain test results, prepare for appointments, build workout routines, and compare insurance plans. Health conversations are covered by dedicated encryption and isolation, keeping them protected and partitioned from ordinary chats.

A View for the Korean Market and Its Users

Even with the performance gains, responsibility for medical advice still rests with people. OpenAI stressed that it designed the model to recommend follow-up care — a reflection of the premise that AI does not replace doctors. What matters for users is that an "AI answer is not a confirmed diagnosis." Health information should be treated as reference; diagnosis and prescriptions require confirmation from a professional.

In the Korean context, one more caveat applies. The practice patterns of the 260 physicians who built HealthBench do not always match domestic clinical guidelines, and features like insurance-plan comparison assume the U.S. system. Before local users take answers tuned on English-language data at face value, ASAP's view is that it is wiser to stay conscious of the gap between a performance score and real-world safety. This announcement leaves a clear signal: the more trust matters in a domain, the more what you score — and how — matters as much as raw performance.

References: OpenAI — Introducing ChatGPT Health · OpenAI — HealthBench · Healthcare Dive

OpenAI Strengthens ChatGPT's "Health Intelligence" — 28% Gain on HealthBench

How to Read the 28% Figure

The Scorecard Is the Spec

Integration and Isolation: Two-Way Design

A View for the Korean Market and Its Users

Related posts