文档部分重复检测研究
发布时间:2018-04-16 18:38
本文选题:文档部分重复检测 + Low-IDF-SIG特征提取算法 ; 参考:《复旦大学》2012年硕士论文
【摘要】:随着互联网上数据的爆炸式的增长,互联网上产生了大量的重复数据。这些重复数据给搜索引擎、观点挖掘等许多Web应用带来了严峻的问题。目前绝大部分的重复检测的算法均着重考虑文档级别,这些方法不能有效地检测出两个文档中只有一部分互为重复的情况。 本文提出了一种算法以解决文档部分重复检测问题。该方法分为句子级别的重复检测以及序列匹配两个子问题。首先,本文提出了一种快速有效的句子级别的特征提取方法—Low-IDF-SIG算法,并基于该算法实现了一个可以高效地找出句子级别重复的检测系统。为了对本文提出的方法的精度及效率进行评测,作者还在一个真实的语料库上对提出的方法与其他方法进行了比较。实验结果证明本文提出的方法能有效地提高句子级别的重复检测任务的效率和精度。 此外本文还提出了基于MapReduce范式的文档部分重复检测算法PDC-MR-Ⅱ算法。并基于该算法实现了一个基于MapReduce范式的高效的分布式文档部分重复检测系统。 本文中提出的算法和实现的系统可以广泛用于解决论文抄袭检测,论坛话题重复检测、分页新闻的重复检测等课题。
[Abstract]:With the explosive growth of data on the Internet, a large number of duplicate data have been generated on the Internet.These repeated data bring severe problems to many Web applications such as search engine, viewpoint mining and so on.At present, most of the repeated detection algorithms focus on the document level, and these methods can not effectively detect the situation that only one part of the two documents is duplicated with each other.This paper presents an algorithm to solve the problem of document partial repetition detection.The method is divided into two sub-problems: sentence level repetition detection and sequence matching.Firstly, this paper proposes a fast and effective sentence-level feature extraction method-Low-IDF-SIG algorithm, and implements a detection system based on this algorithm, which can efficiently find sentence level repetition.In order to evaluate the accuracy and efficiency of the proposed method, the author also compares the proposed method with other methods on a real corpus.The experimental results show that the proposed method can effectively improve the efficiency and accuracy of sentence level repeat detection.In addition, PDC-MR- 鈪,
本文编号:1760120
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1760120.html