ChatGPT radiologist? Researchers test AI model in exam, and it did quite well

With some very "illogical and inaccurate assertions," it is not ready to replace radiologists yet.
Ameya Paleja
ChatGPT has performed well in radiology exams. Will it examine you?
ChatGPT has performed well in radiology exams. Will it examine you?


Researchers at the Toronto General Hospital in Canada did what most people are doing these days: getting ChatGPT to answer questions of a standard exam and see how it fares. The conversational chatbot scored 81 percent in a 150-question test designed to mimic exams conducted by radiology boards in Canada and the U.S., much above the pass percentage of 70.

Since the launch of ChatGPT, users have been awestruck by its ability to comprehend information and use it to answer queries. This has been put to the test to answer questions for the U.S. Medical Licensing Exam (USMLE) as well as the MBA exam at the Wharton Business School too, where it gave some mediocre performances.

With the use of ChatGPT increasing across sectors, researchers at the University Medical Imaging Toronto decided that it was time to test the abilities of the chatbot in radiology as well.

ChatGPT answers radiology questions

The researchers set up a 150-question test for ChatGPT, much like how the radiology boards in Canada and U.S. do for students. Since the AI bot cannot process images as input, the researchers provided only text in the question, which were grouped into lower-order and higher-order questions.

Questions in the lower-order group test the chatbot on knowledge recall and basic understanding of the subject, while those in the higher order required it to apply, analyze, and synthesize information.

ChatGPT radiologist? Researchers test AI model in exam, and it did quite well
Could AI carry out examinations soon? Researchers are wary

Since there are two versions of GPT currently available, the researchers tested both of them on the same question set to see if one was better than the other.

ChatGPT-powered by the older version, i.e., GPT 3.5, could only score 69 percent on the question set, scoring well on the lower order questions (84 percent, 51 correct out of 61) but struggled with higher order questions managing only 60 percent (53 out of 89).

After GPT-4 was released in March 2023, the researchers tested the improved version of ChatGPT again, which scored 81 percent after getting 121 of the 150 questions correct. As claimed by OpenAI about GPT-4's superior reasoning capabilities, the newly launched large language model scored 81 percent on the higher-order questions. the press release said.

What stumped the researchers, though, is the performance of GPT-4 on lower order questions, where it got 12 questions wrong that GPT3.5 had answered correctly. "We were initially surprised by ChatGPT’s accurate and confident answers to some challenging radiology questions, but then equally surprised by some very illogical and inaccurate assertions,” said Rajesh Bhayana, a radiologist and technology lead at Toronto General Hospital.

While the tendency to confidently deliver incorrect information, dubbed hallucinations, has reduced in GPT-4, it has not been eliminated yet. In medical practice, this can be dangerous, especially when used by novices who may not be able to spot replies as inaccurate, the researchers added.