2Ophthalmology Department, Mustafa Kemal University, Hatay, Turkiye
3Ophthalmology Department, Fırat University, Elazıg, Turkiye DOI : 10.37844/TJ-CEO.2024.19.28 Purpose: Large language models can be used for education and training in glaucoma theoretically. The aim of study is to determine the proficiency and differences of chatbots in the field of glaucoma through self-assessment questions.
Materials and Methods: The self-assessment questions in the last decade were obtained from the American Academy of Ophthalmology Basic and Clinical Science Course Glaucoma Section books to be used in the study. These questions were asked one by one to ChatGPT-3.5 and 4.0, Bing and Bard respectively. The answers recorded as true and false were analyzed to evaluate the performance of artificial intelligence chatbots. Questions were evaluated in six main categories. In addition to descriptive statistical methods, the Fisher?s exact test and Pearson?s chi-square test was used to analyze the chatbots both pairwise and together.
Results: ChatGPT-4.0 had the highest correct response rate at 85.10%. Bing had a good accuracy rate of 81.80%. Bard and ChatGPT-3.5 underperformed, at 67.80% and 64.50%, respectively. There was statistical significance when all groups were compared(p<0.05). In pairwise comparison, there was a statistically significant difference between ChatGPT-4.0 with Bard and ChatGPT-3.5 and between Bing with Bard and ChatGPT-3.5(p<0.05). No significant difference was observed between ChatGPT-4.0 and Bing, Bard and ChatGPT- 3.5(p>0.05).
Conclusion: ChatGPT-4.0 and Bing showed an impressive correct response rate, while ChatGPT-3.5 and Bard were unfortunately inadequate. ChatGPT-4.0 and Bing have the potential to be used in education and training if care is taken to avoid misinformation, inaccurate results, and bias. Bard has a low response rate but is open to improvement.
Keywords : Large language models, glaucoma, ChatGPT, Bing, Bard