中文医学知识大模型问答语料数据集构建研究 Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

中文医学知识大模型问答语料数据集构建研究

引用本文：	吕婷钰,李晓瑛,张颖,刘宇炀,杜晋华,李心怡,罗妍,唐小利,任慧玲,刘辉,尹浩. 中文医学知识大模型问答语料数据集构建研究[J]. 医学信息学杂志, 2024, 45(5): 20-25

作者姓名：	吕婷钰李晓瑛张颖刘宇炀杜晋华李心怡罗妍唐小利任慧玲刘辉尹浩

作者单位：	中国医学科学院/北京协和医学院医学信息研究所/图书馆北京 100005;清华大学网络大数据研究中心北京 100084 〖FQ4。46*4/5,ZX,DY-WZ〗〔修回日期〕 2024-04-04 〔作者简介〕吕婷钰,硕士研究生通信作者:李晓瑛,刘辉。〔基金项目〕国家社会科学基金项目项目编号:20BTQ062中央高校基本科研业务费资助项目项目编号:3332023163。

基金项目：	国家社会科学基金项目(项目编号:20BTQ062)；中央高校基本科研业务费资助项目(项目编号:3332023163)。

摘要：	目的/意义构建中文医学知识问答语料数据集,为医学垂域大模型提供标准化的评测基准,进而提升大模型处理中文医学问答任务的准确率和效率。方法/过程构建中文医学论文知识问答数据集、医学名词解释问答数据集和以中国执业医师资格考试真题为基础的问答数据集,整理相关开源数据集。结果/结论自主构建的中文医学知识问答语料数据集丰富了中文医学问答语料来源,能够作为一项标准化的评测基准,推动医学领域大模型实现客观全面的定量评估,今后将利用电子病历、在线健康社区等数据,为健康中国战略的实施提供更坚实的人工智能支持。
关键词：	大语言模型语料数据集模型评测医学
修稿时间：	2024-04-04
Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models

LYU Tingyu,LI Xiaoying,ZHANG Ying,LIU Yuyang,DU Jinhu,LI Xinyi,LUO Yan,TANG Xiaoli,REN Huiling,LIU Hui,YIN Hao. Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models[J]. Journal of Medical Informatics, 2024, 45(5): 20-25

Authors:	LYU Tingyu LI Xiaoying ZHANG Ying LIU Yuyang DU Jinhu LI Xinyi LUO Yan TANG Xiaoli REN Huiling LIU Hui YIN Hao

Affiliation:	Institute of Medical Information ＆ Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China;Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China

Abstract:	Purpose/Significance To construct a Chinese medical knowledge Q&A corpus dataset as a standardized evaluation benchmark for large language models (LLMs) in the medical domain, so as to improve the accuracy and efficiency of LLMs in handling Chinese medical questions. Method/Process Chinese medical paper knowledge, medical terminology explanations and supplementary questions are acquired from the Chinese medical licensing examination, and open-source Chinese medical Q&A datasets are encompassed in the developed Q&A datasets. Result/Conclusion The Chinese medical knowledge Q&A corpus datasets enrich the sources of existing datasets and promote the objective and comprehensive quantitative evaluation of large models in the medical field. In the near future, additional data such as electronic medical records and those from online health communities will be used to strengthen the support of artificial intelligence for the Healthy China strategy.

Keywords:	large language models；corpus dataset；model evaluation；medicine

	点击此处可从《医学信息学杂志》浏览原始摘要信息
	点击此处可从《医学信息学杂志》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏