首页 | 本学科首页   官方微博 | 高级检索  
     

基于无监督深度学习的纳米孔测序O6-甲基鸟嘌呤识别
引用本文:关晓宇,王宇,张金月,邵伟,黄硕,张道强. 基于无监督深度学习的纳米孔测序O6-甲基鸟嘌呤识别[J]. 生物医学工程学杂志, 2022, 0(1)
作者姓名:关晓宇  王宇  张金月  邵伟  黄硕  张道强
作者单位:南京航空航天大学计算机科学与技术学院工信部模式分析与机器智能重点实验室;南京大学化学化工学院生命科学分析化学国家重点实验室;南京大学化学与生物医学创新中心
基金项目:国家自然科学基金(61861130366,61876082,61732006,62136004);国家重点研发计划(2018YFC2001600,2018YFC2001602)。
摘    要:O6-甲基鸟嘌呤(O6-CMG)是DNA中的一种高致突变烷基化产物,它会导致生命体罹患胃肠道肿瘤。现有的研究主要是利用耻垢分枝杆菌膜蛋白(MspA)纳米孔技术,借助枯草芽孢杆菌噬菌体Phi29 DNA多聚酶(Phi29 DNA polymerase)对突变进行精确定位。近年来,机器学习技术被广泛应用于纳米孔测序数据的分析,但是机器学习往往需要大量的数据标记,这给研究者们带来了额外的工作负担,大大影响了其实用性。因此,本文提出了一种纳米无监督深度学习(nano-UDL)方法,该方法能自动识别含有突变段的纳米孔数据。nanoUDL方法利用深度自动编码器从纳米孔数据中提取特征,然后通过均值漂移(MeanShift)聚类算法对特征数据进行分类。此外,该方法还联合优化了聚类损失和重构损失,从而提取最优的特征用于聚类。实验结果表明,nanoUDL方法在O6-CMG数据集上具有较高的识别精度,能准确识别出所有包含O6-CMG的序列段。为了进一步验证nano-UDL方法的鲁棒性,本文进行了超参数敏感性验证和消融实验。利用nano-UDL方法分析纳米孔数据不但可以有效降低人工分析数据带来的额外成本,而且对包括基因组测序在内的诸多生物研究具有重要意义。

关 键 词:甲基鸟嘌呤  纳米孔测序  DNA损伤  胃肠道肿瘤  深度学习  无监督学习

Unsupervised deep learning for identifying the O6-carboxymethyl guanine by nanopore sequencing
GUAN Xiaoyu,WANG Yu,ZHANG Jinyue,SHAO Wei,HUANG Shuo,ZHANG Daoqiang. Unsupervised deep learning for identifying the O6-carboxymethyl guanine by nanopore sequencing[J]. Journal of biomedical engineering, 2022, 0(1)
Authors:GUAN Xiaoyu  WANG Yu  ZHANG Jinyue  SHAO Wei  HUANG Shuo  ZHANG Daoqiang
Affiliation:(College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence,Nanjing 211106,P.R.China;State Key Laboratory of Analytical Chemistry for Life Sciences,School of Chemistry and Chemical Engineering,Nanjing University,Nanjing 210023,P.R.China;Chemistry and Biomedicine Innovation Center,Nanjing University,Nanjing 210023,P.R.China)
Abstract:O6-carboxymethyl guanine(O6-CMG)is a highly mutagenic alkylation product of DNA that causes gastrointestinal cancer in organisms.Existing studies used mutant Mycobacterium smegmatis porin A(MspA)nanopore assisted by Phi29 DNA polymerase to localize it.Recently,machine learning technology has been widely used in the analysis of nanopore sequencing data.But the machine learning always need a large number of data labels that have brought extra work burden to researchers,which greatly affects its practicability.Accordingly,this paper proposes a nanoUnsupervised-Deep-Learning method(nano-UDL)based on an unsupervised clustering algorithm to identify methylation events in nanopore data automatically.Specially,nano-UDL first uses the deep AutoEncoder to extract features from the nanopore dataset and then applies the MeanShift clustering algorithm to classify data.Besides,nano-UDL can extract the optimal features for clustering by joint optimizing the clustering loss and reconstruction loss.Experimental results demonstrate that nano-UDL has relatively accurate recognition accuracy on the O6-CMG dataset and can accurately identify all sequence segments containing O6-CMG.In order to further verify the robustness of nano-UDL,hyperparameter sensitivity verification and ablation experiments were carried out in this paper.Using machine learning to analyze nanopore data can effectively reduce the additional cost of manual data analysis,which is significant for many biological studies,including genome sequencing.
Keywords:Carboxymethyl guanine  Nanopore sequencing  DNA lesion  Gastrointestinal cancer  Deep Learning  Unsupervised learning
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号