A new study just dropped in the journal Science that might make you think twice about who you want checking you out in a medical crisis. Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center teamed up to see how big AI models stack up against real doctors in the heat of an emergency room. They didn’t just use fake scenarios, either. They used real cases from the Beth Israel ER to put OpenAI’s technology to the ultimate test.
The team looked at 76 different patients who walked into the ER. They took the diagnoses from two human internal medicine doctors and compared them to what GPT-4 and GPT-4o came up with. To keep things fair, two other doctors reviewed the results without knowing which answers came from a human and which came from a machine. The results were eye-opening. The AI didn’t just keep up; it often performed better than the humans.
Winning the First Triage
The gap between man and machine was biggest right at the start of a patient’s visit. This phase, known as triage, is when doctors have the least amount of information but have to make the fastest, most life-altering choices. Harvard’s press release explained that they gave the AI the same raw data available in the electronic medical records at the time of each check-up. They didn’t “pre-process” or clean up the data for the bot.
With that limited info, the AI managed to hit the nail on the head or get very close 67% of the time during triage. One of the human doctors only managed a 55% accuracy rate, and the other fell behind at 50%. Arjun Manrai, who runs an AI lab at Harvard Medical School, stated that the AI models beat every human baseline they tested against.
Why We Aren’t Firing Doctors Yet
While these numbers look great for the tech world, the researchers are quick to add some perspective. The study doesn’t mean AI is ready to take over the ER and start making life-or-death calls solo. Instead, it shows we need real-world trials to see how these tools can actually help in a hospital setting. There are also big gaps in what the AI can do. For example, these models only looked at text-based notes. They can’t yet “reason” through non-text clues, like looking at a patient’s physical state or hearing the tone of their voice.
Accountability is another huge hurdle. Adam Rodman, one of the lead authors and a doctor at Beth Israel, warned that we don’t have a legal framework to handle AI mistakes. He noted that patients still want a human being to walk them through the most difficult moments of their lives.
The Specialist Debate
Some doctors, like emergency physician Kristen Panthagani, think we should be careful with the hype. She pointed out that the study compared AI to internal medicine doctors, not doctors who specialize specifically in ER work. She argued that an ER doctor’s main goal isn’t just to get the final diagnosis right on the first try. Their job is to make sure you don’t have a condition that will kill you in the next few minutes. While an AI might pass a neurosurgery board exam, that doesn’t mean it knows how to handle the chaos of a busy emergency department.

