文档复制检测方法研究与系统实现
发布时间:2018-02-14 18:04
本文关键词: 本复制检测 在线复制检测 关键字提取 相似度计算 倒排索引 出处:《哈尔滨工业大学》2012年硕士论文 论文类型:学位论文
【摘要】:目前,随着互联网的快速发展,网络信息资源日益丰富,人们的信息交流的方式变得日益便利。然而由于文本,图片,视频等网络电子资源便利的复制基础,从而导致网络资源过多的冗余,降低了网络搜索引擎的检索效率,同时加大了信息抽取的难度。近年来一些高校里也频繁出现了作业抄袭,论文抄袭等现象。为了提高网络信息检索效率、保护知识产权,以及端正学术风气,文档复制检测技术成为了自然语言处理领域的研究热点,其研究意义十分重大。 本文对文档复制检测方面做了详细研究,在前人研究的基础上,对基于句子相似度计算的文档复制检测方法作了改进,很大程度上提高了文档复制检测效率与检测准确率。 首先,,本文针对文档复制检测的背景、意义、国内外发展现状及相关技术作了详细介绍,并分析了目前常用文本复制检测算法的优缺点。 其次,基于传统的BSP复制检测算法,提出了基于有序最长公共关键词序列的句子相似度算法及基于关键词距离的句子局部复制检测算法,同时设计了词语-句子,句子-文档的倒排索引结构,有效地提高了复制检测准确率与检测效率。 再次,基于本文提出的文本复制检测方法,设计实现了一款文本复制检测系统。根据实际应用需求,系统主要功能包括文档注册、文档检索、同义词维护、本地复制检测、分布式复制检测,在线复制检测、网络设置、系统设置、文档库管理等。 最后,实验表明:结果本文所研究的文档复制检测方法的实用性和有效性。
[Abstract]:At present, with the rapid development of the Internet, the network information resources are increasingly rich, and the way people exchange information becomes more and more convenient. However, due to the convenient reproduction basis of electronic resources such as text, pictures, video and so on, This leads to excessive redundancy of network resources, reduces the search efficiency of network search engines, and increases the difficulty of information extraction. In recent years, some colleges and universities have also frequently appeared homework plagiarism. In order to improve the efficiency of network information retrieval, protect intellectual property rights, and correct the academic atmosphere, document replication and detection technology has become the research hotspot in the field of natural language processing, and its research significance is very important. This paper makes a detailed study on document replication detection. On the basis of previous studies, the paper improves the document replication detection method based on sentence similarity calculation, which greatly improves the efficiency and accuracy of document replication detection. First of all, this paper introduces the background, significance, development status and related technologies of document replication detection in detail, and analyzes the advantages and disadvantages of common text copy detection algorithms. Secondly, based on the traditional BSP replication detection algorithm, a sentence similarity algorithm based on ordered longest common keyword sequence and a sentence local copy detection algorithm based on keyword distance are proposed. At the same time, word-sentence is designed. Sentence-document inverted index structure effectively improves the accuracy and efficiency of copy detection. Thirdly, based on the text copy detection method proposed in this paper, a text copy detection system is designed and implemented. According to the actual application requirements, the main functions of the system include document registration, document retrieval, synonym maintenance, local copy detection. Distributed replication detection, online replication detection, network settings, system settings, document library management, etc. Finally, the experimental results show the practicability and effectiveness of the document copy detection method studied in this paper.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 樊勇;郑家恒;;基于主题的网页去重[J];电脑开发与应用;2008年04期
2 阎亚杰;;网页去重方法研究[J];电脑开发与应用;2008年08期
3 彭宜佳;;毕业论文抄袭的识别与防范[J];湖北广播电视大学学报;2006年06期
4 宋擒豹,沈钧毅;数字商品非法复制和扩散的监测机制[J];计算机研究与发展;2001年01期
5 张义忠,赵明生,朱精南;基于内容的网页特征提取[J];计算机工程与应用;2001年10期
6 金博,史彦军,滕弘飞;中文文档复制检测系统研究[J];计算机工程;2005年19期
7 李欣,舒风笛;最长公共子序列问题的改进快速算法[J];计算机应用研究;2000年02期
8 姚新波;马治坤;;基于特征串的网页去重算法[J];科技信息;2008年28期
9 林春实,方燕,全吉成;汉语文献自动分词与标引技术发展浅析[J];情报学报;1997年S1期
10 付年钧;彭昌水;王慰;;中文分词技术及其实现[J];软件导刊;2011年01期
本文编号:1511286
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1511286.html