Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Abstract

Abstract Background General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. Objective To assess performance of three LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation. Methods The 149-question Self-Assessment Neurosurgery Exam (SANS) Indications Exam was used to query LLM accuracy. Questions were input in a single best answer, multiple-choice format. Chi-squared, Fisher’s exact, and univariable logistic regression tests assessed differences in performance by question characteristics. Results On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% confidence interval [CI]: 54.1-70.1%) and 82.6% (95% CI: 75.2-88.1%) of questions correctly, respectively. In contrast, Bard scored 44.2% (66/149, 95% CI: 36.2-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P <0.01), and GPT-4 significantly outperformed GPT-3.5 ( P =0.023). Among six subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in four categories relative to Bard (all P <0.01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (OR=0.80, P =0.042) and Bard (OR=0.76, P =0.014), but not GPT-4 (OR=0.86, P =0.085). GPT-4’s performance on imaging-related questions surpassed GPT-3.5’s (68.6% vs. 47.1%, P =0.044) and was comparable to Bard’s (68.6% vs. 66.7%, P =1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs. 57.1%, P <0.001) and Bard (2.3% vs. 27.3%, P =0.002). Lack of question text description for imaging predicted significantly higher odds of hallucination for GPT-3.5 (OR=1.45, P =0.012) and Bard (OR=2.09, P <0.001). Conclusion On a question bank of predominantly higher-order management case scenarios intended for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google’s Bard.

References

Page 1

	Year	Citations

Page 1