Concepedia

Publication | Open Access

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions

16

Citations

10

References

2024

Year

Abstract

Background GPT-4V (GPT-4 with vision, ChatGPT; OpenAI) has shown impressive performance in several medical assessments. However, few studies have assessed its performance in interpreting radiologic images. Purpose To assess and compare the accuracy of GPT-4V in assessing radiologic cases with both images and textual context to that of radiologists and residents, to assess if GPT-4V assistance improves human accuracy, and to assess and compare the accuracy of GPT-4V with that of image-only or text-only inputs. Materials and Methods Seventy-two Case of the Day questions at the RSNA 2023 Annual Meeting were curated in this observer study. Answers from GPT-4V were obtained between November 26 and December 10, 2023, with the following inputs for each question: image only, text only, and both text and images. Five radiologists and three residents also answered the questions in an "open book" setting. For the artificial intelligence (AI)-assisted portion, the radiologists and residents were provided with the outputs of GPT-4V. The accuracy of radiologists and residents, both with and without AI assistance, was analyzed using a mixed-effects linear model. The accuracies of GPT-4V with different input combinations were compared by using the McNemar test. <i>P</i> < .05 was considered to indicate a significant difference. Results The accuracy of GPT-4V was 43% (31 of 72; 95% CI: 32, 55). Radiologists and residents did not significantly outperform GPT-4V in either imaging-dependent (59% and 56% vs 39%; <i>P</i> = .31 and .52, respectively) or imaging-independent (76% and 63% vs 70%; both <i>P</i> = .99) cases. With access to GPT-4V responses, there was no evidence of improvement in the average accuracy of the readers. The accuracy obtained by GPT-4V with text-only and image-only inputs was 50% (35 of 70; 95% CI: 39, 61) and 38% (26 of 69; 95% CI: 27, 49), respectively. Conclusion The radiologists and residents did not significantly outperform GPT-4V. Assistance from GPT-4V did not help human raters. GPT-4V relied on the textual context for its outputs. © RSNA, 2024 <i>Supplemental material is available for this article.</i> See also the editorial by Katz in this issue.

References

YearCitations

Page 1