基于指纹检索的文本相似性检测技术研究与应用
发布时间:2018-02-13 01:18
本文关键词: 文本相似性检测 指纹检索 b位minwise哈希 细粒度提取 聚类 出处:《中南大学》2013年硕士论文 论文类型:学位论文
【摘要】:网络的开放性与文本的易复制性为学术资源的共享提供方便的同时也为抄袭、剽窃等学术不端行为提供了机会。从保护知识产权、端正学术风气等角度出发,文本相似性检测相关技术的研究已成为十分必要的方向。 论文以某基金项目申报书相似性检测为应用背景,为了在海量文档中快速、准确地检测出相似的文档,主要研究基于指纹检索的相似性检测系统中所涉及的关键技术如指纹快速检索算法与技术、指纹的提取模型与方法等,具体的研究工作如下: (1)针对海量文本相似性检索中指纹数少导致相似度估值不准确、高维向量距离计算耗时等问题,提出基于指纹分组的并行检索算法,将指纹分组建立索引,预检索低位指纹,从而减少文档的距离计算。同时,通过在指纹的检索过程中使用CPU+GPU并行技术,整体缩短指纹的检索时间,并提高低相似度阈值的检索准确度。 (2)针对文档内容结构性、各章节多样性及用户对文档不同部分关注度差异较明显等特点,论文主要研究细粒度划分方法、标记词的模糊匹配、中文分词等技术,实现章节、段落、句子等粗细粒度的精确提取。针对基金项目检测准确性的要求,使用了基于字符串匹配的最大正向匹配算法和最大反向匹配算法相结合的方法确保特征指纹提取的准确率,所形成的指纹能确保后续的检测质量,并能直观、清晰地呈现相似性证据。 (3)论文论述了文本相似性检查系统的功能框架与主要流程,对文档聚类、相似性估计及文档相似性详细比对与结果呈现等技术进行了详细分析,结合提出的指纹分组并行检索算法与细粒度文本提取技术进行了实现研究。图20幅,表4个,参考文献56篇。
[Abstract]:The openness of the network and the easy reproduction of the text provide an opportunity for academic misconduct such as plagiarism and plagiarism, as well as for the sharing of academic resources. The research of text similarity detection technology has become a very necessary direction. In order to quickly and accurately detect similar documents in a large number of documents, this paper takes the similarity detection of a fund project declaration as the application background. The key technologies involved in the similarity detection system based on fingerprint retrieval, such as fingerprint fast retrieval algorithms and techniques, fingerprint extraction models and methods, are mainly studied. The specific research work is as follows:. In order to solve the problems of imprecise similarity estimation and time-consuming computation of high dimensional vector distance in mass text similarity retrieval, a parallel retrieval algorithm based on fingerprint grouping is proposed to index fingerprint grouping and pre-retrieve low fingerprint. At the same time, by using CPU GPU parallel technology in fingerprint retrieval, the retrieval time of fingerprint is shortened, and the retrieval accuracy of low similarity threshold is improved. 2) aiming at the characteristics of document content structure, the diversity of each chapter and the difference of user's attention to different parts of the document, this paper mainly studies the fine granularity partition method, the fuzzy matching of tagging words, the Chinese word segmentation and so on, so as to realize the chapter. Accurate extraction of paragraphs, sentences, etc. For the accuracy of fund project detection, The maximum forward matching algorithm based on string matching and the maximum reverse matching algorithm are used to ensure the accuracy of feature fingerprint extraction. The resulting fingerprint can ensure the quality of subsequent detection and can be intuitionistic. Clear evidence of similarity. The paper discusses the functional framework and main flow of the text similarity checking system, and analyzes in detail the techniques of document clustering, similarity estimation, document similarity comparison and result presentation. Combined with the proposed fingerprint grouping parallel retrieval algorithm and fine-grained text extraction technology, the paper studies the implementation of the algorithm, which includes 20 figures, 4 tables and 56 references.
【学位授予单位】:中南大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 金博,史彦军,滕弘飞;基于语义理解的文本相似度算法[J];大连理工大学学报;2005年02期
2 韩京宇;徐立臻;董逸生;;一种大数据量的相似记录检测方法[J];计算机研究与发展;2005年12期
3 费洪晓,康松林,朱小娟,谢文彪;基于词频统计的中文分词的研究[J];计算机工程与应用;2005年07期
4 麻会东;刘国华;李旭;梁鹏;刘春辉;张凌宇;;基于提取关键词的中文文档复制检测研究[J];计算机工程与科学;2007年10期
5 宋擒豹,杨向荣,沈钧毅,齐勇;数字商品非法复制的检测算法[J];计算机学报;2002年11期
6 李庆虎,陈玉健,孙家广;一种中文分词词典新机制——双字哈希机制[J];中文信息学报;2003年04期
7 徐琳宏;林鸿飞;杨志豪;;基于语义理解的文本倾向性识别机制[J];中文信息学报;2007年01期
8 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期
9 鲍军鹏,沈钧毅,刘晓东,宋擒豹;自然语言文档复制检测研究综述[J];软件学报;2003年10期
10 张祖平;徐昕;龙军;袁鑫攀;;文本相似性度量中参数相关性与优化配置研究[J];小型微型计算机系统;2011年05期
,本文编号:1506995
本文链接:https://www.wllwen.com/falvlunwen/zhishichanquanfa/1506995.html