首页 | 本学科首页   官方微博 | 高级检索  
     

中文医学知识大模型问答语料数据集构建研究
引用本文:吕婷钰,李晓瑛,张颖,刘宇炀,杜晋华,李心怡,罗妍,唐小利,任慧玲,刘辉,尹浩. 中文医学知识大模型问答语料数据集构建研究[J]. 医学信息学杂志, 2024, 45(5): 20-25
作者姓名:吕婷钰  李晓瑛  张颖  刘宇炀  杜晋华  李心怡  罗妍  唐小利  任慧玲  刘辉  尹浩
作者单位:中国医学科学院/北京协和医学院医学信息研究所/图书馆 北京 100005;清华大学网络大数据研究中心 北京 100084 〖FQ4。46*4/5,ZX,DY-WZ〗〔修回日期〕 2024-04-04 〔作者简介〕 吕婷钰,硕士研究生通信作者:李晓瑛,刘辉。〔基金项目〕 国家社会科学基金项目项目编号:20BTQ062中央高校基本科研业务费资助项目项目编号:3332023163。
基金项目:国家社会科学基金项目(项目编号:20BTQ062);中央高校基本科研业务费资助项目(项目编号:3332023163)。
摘    要:目的/意义 构建中文医学知识问答语料数据集,为医学垂域大模型提供标准化的评测基准,进而提升大模型处理中文医学问答任务的准确率和效率。方法/过程 构建中文医学论文知识问答数据集、医学名词解释问答数据集和以中国执业医师资格考试真题为基础的问答数据集,整理相关开源数据集。结果/结论 自主构建的中文医学知识问答语料数据集丰富了中文医学问答语料来源,能够作为一项标准化的评测基准,推动医学领域大模型实现客观全面的定量评估,今后将利用电子病历、在线健康社区等数据,为健康中国战略的实施提供更坚实的人工智能支持。

关 键 词:大语言模型  语料数据集  模型评测  医学
修稿时间:2024-04-04

Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models
LYU Tingyu,LI Xiaoying,ZHANG Ying,LIU Yuyang,DU Jinhu,LI Xinyi,LUO Yan,TANG Xiaoli,REN Huiling,LIU Hui,YIN Hao. Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models[J]. Journal of Medical Informatics, 2024, 45(5): 20-25
Authors:LYU Tingyu  LI Xiaoying  ZHANG Ying  LIU Yuyang  DU Jinhu  LI Xinyi  LUO Yan  TANG Xiaoli  REN Huiling  LIU Hui  YIN Hao
Affiliation:Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China;Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:Purpose/Significance To construct a Chinese medical knowledge Q&A corpus dataset as a standardized evaluation benchmark for large language models (LLMs) in the medical domain, so as to improve the accuracy and efficiency of LLMs in handling Chinese medical questions. Method/Process Chinese medical paper knowledge, medical terminology explanations and supplementary questions are acquired from the Chinese medical licensing examination, and open-source Chinese medical Q&A datasets are encompassed in the developed Q&A datasets. Result/Conclusion The Chinese medical knowledge Q&A corpus datasets enrich the sources of existing datasets and promote the objective and comprehensive quantitative evaluation of large models in the medical field. In the near future, additional data such as electronic medical records and those from online health communities will be used to strengthen the support of artificial intelligence for the Healthy China strategy.
Keywords:large language models;corpus dataset;model evaluation;medicine
点击此处可从《医学信息学杂志》浏览原始摘要信息
点击此处可从《医学信息学杂志》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号