首页 | 本学科首页   官方微博 | 高级检索  
检索        

归并奇异值分解:一种快速更新隐含语义索引的方法
引用本文:黄明,林家骏.归并奇异值分解:一种快速更新隐含语义索引的方法[J].医学教育探索,2018,44(3):397-403.
作者姓名:黄明  林家骏
作者单位:华东理工大学信息科学与工程学院, 上海 200237,华东理工大学信息科学与工程学院, 上海 200237
摘    要:隐含语义索引(LSI)是一种解决信息检索中二义性问题和大规模文档分类的文档索引方法。为了提高LSI效率,应对大数据场景下文档量爆发式增长的问题,提出了一种通过归并奇异值分解来实现LSI快速更新的方法。该方法利用p-边宽单边对角矩阵和箭头矩阵分解技术来加快中间矩阵的奇异值分解过程,并通过将新增文档矩阵的薄奇异值分解(PSVD)归并进主文档矩阵的PSVD以避免重复计算,加快LSI更新速度。通过数学证明论证了该方法的有效性,并讨论了该算法扩展到词条更新场景中的情形。在多个测试数据集上的实验验证了该方法可以在保证检索准确率的前提下有效提高LSI的更新效率。

关 键 词:信息检索  隐含语义索引  奇异值分解  文档聚类  QR分解  箭头型矩阵
收稿时间:2017/7/24 0:00:00

Merging PSVDs: An Approach to Fast Update Latent Semantic Indexing
HUANG Ming and LIN Jia-jun.Merging PSVDs: An Approach to Fast Update Latent Semantic Indexing[J].Researches in Medical Education,2018,44(3):397-403.
Authors:HUANG Ming and LIN Jia-jun
Institution:School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China and School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
Abstract:As well known, the singular value decomposition (SVD) is a useful tool for analyzing and dealing with high dimensional vectors, or even matrices in the field of information retrieval and image processing, due to its strong ability in reducing the dimensions of the question space. Latent semantic indexing (LSI) is a popular application of the SVD algorithm in information retrieval, which can overcome the two fundamental problems plaguing traditional lexical-matching indexing schemes, synonymy and polysemy. In order to deal with massive and rapidly growing data, some algorithms via incrementally updating LSI on new documents have been recently developed. In this paper, we propose an improved LSI algorithm for achieving faster computation speed without retrieval accuracy degradation, compared with universal LSI updating algorithms, e.g., Zha''s and derivatives. There are two main time-consuming steps in this algorithm, the one of which is to compute the QR decomposition of incremental left vector matrix, and the other is to compute the SVD of intermediate matrix. It is noted that there are iterations in both of the two steps hard to parallelize. The proposed approach in this work updates LSI by merging the partial singular value decomposition (PSVD) of incremental documents into the main PSVD. Compared with Zha''s method, the proposed algorithm performs the QR decomposition on a thinner matrix and the intermediate SVD on a s-bordered diagonal matrix. Moreover, the proposed algorithm can be parallelized on computing PSVDs of multiple new document sets and merging them into the main PSVD. The theoretical justification for using the PSVD of incremental documents in the updating process is also attached via supplement. Finally, the experiments via comparing the existing methods and the proposed LSI updating method are provided for illustrating the acceleration and accuracy. Furthermore, the proposed algorithm could even be used as a universal algorithm for merging PSVDs.
Keywords:information retrieval  latent semantic indexing  singular value decomposition  document clustering  QR decomposition  arrowhead matrix
点击此处可从《医学教育探索》浏览原始摘要信息
点击此处可从《医学教育探索》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号