基于相似度估计文档复制检测系统的设计与实现
发布时间:2019-03-16 15:31
【摘要】:随着计算机网络应用技术的发展,互联网中相似信息的数量呈几何级增长,越来越多的高相似度文档一方面消耗了高额的网络储存空间,另一方面也对用户体验造成了不良影响。信息平台的开放性与数字化文本的易获性造成了论文的抄袭甚至是非法剽窃等学术不端行为有越演越烈之势,造成的严重后果不言而喻。为提高信息检索效率和保护知识产权,利用相似度估计技术来设计和实现文档复制检测系统具有重要技术意义和应用价值。为了在海量数据环境中快速地、准确地检测出相似性文档,论文围绕文档相似度估计的相关理论与方法进行了深入的研究,设计并实现了基于相似度估计的文档复制检测系统。论文的主要工作体现如下:论文基于minwise相似性估计子,使用设计并实现了一套文档相似性检测系统,涵盖了文档信息预处理、相似性计算、相似性结果呈现及导出三个子功能系统,重点解决了项目文档聚类、相似度估值算法、相似性证据着色、相似性报告单生成和数据统计分析等问题。以软件工程中的瀑布模型为设计主线,论文详细介绍了基于相似度估计的文档相似性检测系统的业务需求、系统架构设计、功能设计和主要业务流程设计,并对主要功能,给出了系统的实现环境、界面设计以及关键功能模块的实现过程。经过本课题的研发测试,最终得到的系统拥有更为人性化的操作,各类格式的文本(pdf、word)的提取率和相似性比对的计算效率显著提升。
[Abstract]:With the development of computer network application technology, the number of similar information in the Internet is increasing exponentially. On the one hand, more and more documents with high similarity consume high amount of network storage space. On the other hand, it also has a negative impact on the user experience. The openness of information platform and the availability of digital text result in academic misconduct such as plagiarism and even illegal plagiarism. The serious consequences are self-evident. In order to improve the efficiency of information retrieval and protect intellectual property, it is of great technical significance and application value to design and implement a document copy detection system by using similarity estimation technology. In order to detect similarity documents quickly and accurately in the environment of massive data, this paper researches deeply on the theory and method of document similarity estimation, and designs and implements a document copy detection system based on similarity estimation. The main work of this paper is as follows: based on the minwise similarity estimator, a set of document similarity detection system is designed and implemented, which covers the pre-processing of document information, similarity calculation, and so on. Three sub-functional systems are presented and derived from similarity results, which focus on solving the problems of project document clustering, similarity estimation algorithm, similarity evidence coloring, similarity report form generation and data statistical analysis. Based on the waterfall model in software engineering, the paper introduces the business requirements, system architecture design, function design and main business process design of document similarity detection system based on similarity estimation in detail. The implementation environment, interface design and key function modules of the system are given. Through the research and development of this project, the final system has a more user-friendly operation, and the extraction rate of various formats of text (pdf,word) and the computing efficiency of similarity comparison are significantly improved.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1
,
本文编号:2441642
[Abstract]:With the development of computer network application technology, the number of similar information in the Internet is increasing exponentially. On the one hand, more and more documents with high similarity consume high amount of network storage space. On the other hand, it also has a negative impact on the user experience. The openness of information platform and the availability of digital text result in academic misconduct such as plagiarism and even illegal plagiarism. The serious consequences are self-evident. In order to improve the efficiency of information retrieval and protect intellectual property, it is of great technical significance and application value to design and implement a document copy detection system by using similarity estimation technology. In order to detect similarity documents quickly and accurately in the environment of massive data, this paper researches deeply on the theory and method of document similarity estimation, and designs and implements a document copy detection system based on similarity estimation. The main work of this paper is as follows: based on the minwise similarity estimator, a set of document similarity detection system is designed and implemented, which covers the pre-processing of document information, similarity calculation, and so on. Three sub-functional systems are presented and derived from similarity results, which focus on solving the problems of project document clustering, similarity estimation algorithm, similarity evidence coloring, similarity report form generation and data statistical analysis. Based on the waterfall model in software engineering, the paper introduces the business requirements, system architecture design, function design and main business process design of document similarity detection system based on similarity estimation in detail. The implementation environment, interface design and key function modules of the system are given. Through the research and development of this project, the final system has a more user-friendly operation, and the extraction rate of various formats of text (pdf,word) and the computing efficiency of similarity comparison are significantly improved.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1
,
本文编号:2441642
本文链接:https://www.wllwen.com/falvlunwen/zhishichanquanfa/2441642.html