首页 | 本学科首页   官方微博 | 高级检索  
     

基于预训练语言模型的中文专利自动分类研究
引用本文:马 俊,吕璐成,赵亚娟,李聪颖. 基于预训练语言模型的中文专利自动分类研究[J]. 中华医学图书情报杂志, 2022, 31(11): 20-28
作者姓名:马 俊  吕璐成  赵亚娟  李聪颖
作者单位:军事科学院军事科学信息研究中心,北京 100142;中国科学院文献情报中心,北京 100190
摘    要:
目的:支撑大规模中文专利精准自动分类工作,利用改进中文专利文本表示的预训练语言模型实现专利的自动分类。方法:基于中文预训练语言模型RoBERTa,在大规模中文发明专利语料上分别使用单字遮盖策略和全词遮盖策略遮盖语言模型任务进行迁移学习,得到改进中文专利文本表示的RoBERTa模型(ZL-RoBERTa)和RoBERTa-wwm模型(ZL-RoBERTa-wwm);将模型应用到专利文本分类任务中进行实验研究,并与典型深度学习模型(Word2Vec+BiGRU+ATT+TextCNN)和当前先进的预训练语言模型BERT、RoBERTa进行对比分析。结果:基于ZL-RoBERTa和ZL-RoBERTa-wwm的中文专利自动分类模型在专利文本分类任务上的分类精准率/召回率/F1值更为突出。结论:改进文本表示的中文专利预训练语言模型用于专利文本分类具有更优效果,这为后续专利情报工作中应用预训练模型提供了模型基础。

关 键 词:中文专利  文本表示  预训练语言模型  文本分类
收稿时间:2022-10-09

Research on automatic classification of Chinese patents based on pre-trained language models
MA Jun,LV Lu-cheng,ZHAO Ya-juan,LI Cong-ying. Research on automatic classification of Chinese patents based on pre-trained language models[J]. Chinese Journal of Medical Library and Information Science, 2022, 31(11): 20-28
Authors:MA Jun  LV Lu-cheng  ZHAO Ya-juan  LI Cong-ying
Affiliation:Information Research Center of Military Sciences, Academy of Military Sciences, Beijing 100142, China;National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Abstract:
Objective To support the accurate automatic classification of large-scale Chinese patents, this paper explored the use of pre-trained language models that improved the text representation of Chinese patents to achieve automatic classification. Methods Based on the Chinese RoBERTa model, the RoBERTa model (ZL-RoBERTa) and RoBERTa-wwm model (ZL-RoBERTa-wwm) for improving the Chinese Patent text representation are obtained by using the Masked Language Model tasks of Single-word Masking strategy and Whole Word Masking strategy respectively for transfer learning on a large-scale Chinese invention patent corpus. The model was applied to the patent text classification tasks for experimental study and compared with typical deep learning models (Word2Vec+BiGRU+ATT+TextCNN) and current state-of-the-art pre-trained language models BERT and RoBERTa for analysis. Results The classification Precision/Recall/F1 values of ZL-RoBERTa-based and ZL-RoBERTa-wwm-based Chinese patent automatic classification models were more outstanding on patent text classification tasks. Conclusion The Chinese patent pre-trained language model with improved text representation is more effective for patent text classification, which provides a model basis for the subsequent application of pre-trained language models in patent intelligence work.
Keywords:Chinese patent   Text representation   Pre-trained language model   Text classification
点击此处可从《中华医学图书情报杂志》浏览原始摘要信息
点击此处可从《中华医学图书情报杂志》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号