首页 | 本学科首页   官方微博 | 高级检索  
     

不同大型语言模型与不同水平医学专业人士回答眼科问题的对比研究
引用本文:黄慧,胡瑾瑜,王晓宇,叶书苑,吴世楠,陈程,何良琪,曾艳梅,魏红,邵毅. 不同大型语言模型与不同水平医学专业人士回答眼科问题的对比研究[J]. 国际眼科杂志, 2024, 24(3): 458-462
作者姓名:黄慧  胡瑾瑜  王晓宇  叶书苑  吴世楠  陈程  何良琪  曾艳梅  魏红  邵毅
作者单位:中国江西省南昌市,南昌大学第一附属医院眼科; 中国上海市,复旦大学附属眼耳鼻喉科医院;中国福建省厦门市,厦门大学眼科研究所
摘    要:

目的:评估3种不同的大型语言模型(LLM,包括GPT-3.5、GPT-4和PaLM2)在回答眼科专业问题中的表现并与3种不同水平的专业人群(医学本科生、医学硕士、主治医师)进行比较。

方法:分别对三种不同的LLM和3种不同水平的专业人群(包括了本科生9名,专业型研究生6名,主治医师3名)进行一项由100道眼科单项选择题组成的测试,问题涵盖了眼科基础知识、临床知识、眼科检查诊断方法以及眼病相关治疗手段。从平均得分、答题稳定性和答题自信心等方面综合评估LLM的性能并与人类组进行比较。

结果:在平均测试得分中,每个LLM都在总体上优于本科生(GPT-4:56分,GPT-3.5:42分,PaLM2:47分,本科生:40分),其中GPT-3.5、PaLM2略低于硕士水平(硕士:51分),而GPT-4则表现出与主治医师相当的水平(主治医师:62分)。另外,GPT-4表现出明显高于GPT-3.5和PaLM2的答题稳定性和答题自信心。

结论:以GPT-4为代表的LLM在眼科领域表现的较为出色,LLM模型可为临床医生和医学教育进行临床决策及教学辅助。

关 键 词:大型语言模型(LLM)   自然语言处理   眼科问题
收稿时间:2023-11-02
修稿时间:2024-01-26

Comparative study of different large language models and medical professionals of different levels responding to ophthalmology questions
Huang Hui,Hu Jinyu,Wang Xiaoyu,Ye Shuyuan,Wu Shinan,Chen Cheng,He Liangqi,Zeng Yanmei,Wei Hong,Shao Yi. Comparative study of different large language models and medical professionals of different levels responding to ophthalmology questions[J]. International Eye Science, 2024, 24(3): 458-462
Authors:Huang Hui  Hu Jinyu  Wang Xiaoyu  Ye Shuyuan  Wu Shinan  Chen Cheng  He Liangqi  Zeng Yanmei  Wei Hong  Shao Yi
Affiliation:Department of Ophthalmology, the First Affiliated Hospital of Nanchang University, Nanchang 330006, Jiangxi Province, China; Eye & ENT Hospital of Fudan University, Shanghai 200126, China;Eye Institute of Xiamen University, Xiamen 361104, Fujian Province, China;*Co-first authors: Huang Hui and Hu Jinyu
Abstract:AIM: To evaluate the performance of three distinct large language models(LLM), including GPT-3.5, GPT-4, and PaLM2, in responding to queries within the field of ophthalmology, and to compare their performance with three different levels of medical professionals: medical undergraduates, master of medicine, and attending physicians.

METHODS: A total of 100 ophthalmic multiple-choice tests, which covered ophthalmic basic knowledge, clinical knowledge, ophthalmic examination and diagnostic methods, and treatment for ocular disease, were conducted on three different kinds of LLM and three different levels of medical professionals(9 undergraduates, 6 postgraduates and 3 attending physicians), respectively. The performance of LLM was comprehensively evaluated from the aspects of mean scores, consistency and confidence of response, and it was compared with human.

RESULTS: Notably, each LLM surpassed the average performance of undergraduate medical students(GPT-4:56, GPT-3.5:42, PaLM2:47, undergraduate students:40). Specifically, performance of GPT-3.5 and PaLM2 was slightly lower than those of master''s students(51), while GPT-4 exhibited a performance comparable to attending physicians(62). Furthermore, GPT-4 showed significantly higher response consistency and self-confidence compared with GPT-3.5 and PaLM2.

CONCLUSION: LLM represented by GPT-4 performs well in the field of ophthalmology, and the LLM model can provide clinical decision-making and teaching aids for clinicians and medical education.

Keywords:large language models(LLM)   natural language processing   ophthalmology question
点击此处可从《国际眼科杂志》浏览原始摘要信息
点击此处可从《国际眼科杂志》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号