文档复制检测方法研究与系统实现

发布时间：2018-02-14 18:04

本文关键词： 本复制检测在线复制检测关键字提取相似度计算倒排索引　出处：《哈尔滨工业大学》2012年硕士论文　论文类型：学位论文

【摘要】：目前，随着互联网的快速发展，网络信息资源日益丰富，人们的信息交流的方式变得日益便利。然而由于文本，图片，视频等网络电子资源便利的复制基础，从而导致网络资源过多的冗余，降低了网络搜索引擎的检索效率，同时加大了信息抽取的难度。近年来一些高校里也频繁出现了作业抄袭，论文抄袭等现象。为了提高网络信息检索效率、保护知识产权，以及端正学术风气，文档复制检测技术成为了自然语言处理领域的研究热点，其研究意义十分重大。本文对文档复制检测方面做了详细研究，在前人研究的基础上，对基于句子相似度计算的文档复制检测方法作了改进，很大程度上提高了文档复制检测效率与检测准确率。首先，，本文针对文档复制检测的背景、意义、国内外发展现状及相关技术作了详细介绍，并分析了目前常用文本复制检测算法的优缺点。其次，基于传统的BSP复制检测算法，提出了基于有序最长公共关键词序列的句子相似度算法及基于关键词距离的句子局部复制检测算法，同时设计了词语-句子，句子-文档的倒排索引结构，有效地提高了复制检测准确率与检测效率。再次，基于本文提出的文本复制检测方法，设计实现了一款文本复制检测系统。根据实际应用需求，系统主要功能包括文档注册、文档检索、同义词维护、本地复制检测、分布式复制检测，在线复制检测、网络设置、系统设置、文档库管理等。最后，实验表明：结果本文所研究的文档复制检测方法的实用性和有效性。
[Abstract]:At present, with the rapid development of the Internet, the network information resources are increasingly rich, and the way people exchange information becomes more and more convenient. However, due to the convenient reproduction basis of electronic resources such as text, pictures, video and so on, This leads to excessive redundancy of network resources, reduces the search efficiency of network search engines, and increases the difficulty of information extraction. In recent years, some colleges and universities have also frequently appeared homework plagiarism. In order to improve the efficiency of network information retrieval, protect intellectual property rights, and correct the academic atmosphere, document replication and detection technology has become the research hotspot in the field of natural language processing, and its research significance is very important. This paper makes a detailed study on document replication detection. On the basis of previous studies, the paper improves the document replication detection method based on sentence similarity calculation, which greatly improves the efficiency and accuracy of document replication detection. First of all, this paper introduces the background, significance, development status and related technologies of document replication detection in detail, and analyzes the advantages and disadvantages of common text copy detection algorithms. Secondly, based on the traditional BSP replication detection algorithm, a sentence similarity algorithm based on ordered longest common keyword sequence and a sentence local copy detection algorithm based on keyword distance are proposed. At the same time, word-sentence is designed. Sentence-document inverted index structure effectively improves the accuracy and efficiency of copy detection. Thirdly, based on the text copy detection method proposed in this paper, a text copy detection system is designed and implemented. According to the actual application requirements, the main functions of the system include document registration, document retrieval, synonym maintenance, local copy detection. Distributed replication detection, online replication detection, network settings, system settings, document library management, etc. Finally, the experimental results show the practicability and effectiveness of the document copy detection method studied in this paper.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】