基于篇章结构的抄袭论文识别系统的研究与实现

发布时间：2018-08-08 13:49

【摘要】： 目前,剽窃已经是一个日益严重的问题。随着数字化图书馆、互联网的普及和迅速发展,大量的以数字形式存在的资源使剽窃变得更加容易,尤其是学生和学术研究人员,他们通过网络搜索工具很容易就可以找到与课题研究相关的内容。特别是近几年来,抄袭、一稿多投等一系列的剽窃事件屡见报端,其问题的严重性越来越引起人们的重视。要杜绝此类现象、净化学术氛围,除了要加强对学生的教育、制定相应的法律法规外,建立有效的抄袭识别系统已经刻不容缓。总结现有的抄袭论文检测技术和系统,存在几个缺陷:第一,现有的原型系统对于目前较普遍的一篇论文剽窃多篇论文的剽窃方式没有做出分析研究。第二,现有的原型系统在检测过程中大都没有加入篇章结构相似度的计算,即便考虑了篇章结构的特征也并不全面或者存在着不合理因素。第三,对于已经发生剽窃行为的文档,现有的原型系统没有给出相应抄袭类型的判别,对于十分明显的抄袭类型,不能快速、准确地捕获。因此,本文研究了现有的复制检测技术,同时分析了抄袭论文具备的特征,最后采用类似COPS的数字指纹方法识别学术论文中的完全抄袭和部分抄袭;采用基于篇章信息的词频统计方法识别隐式抄袭,并对改进前后的方法利用P-R和MAP指标进行了实验对比。
[Abstract]:Plagiarism is now an increasingly serious problem. With the popularity and rapid development of digital libraries and the Internet, plagiarism has become easier to plagiarism, especially for students and academic researchers, who can easily find content related to research through web search tools. In particular, in recent years, a series of plagiarism, plagiarism, multiple plagiarism and other plagiarism events have been repeated, and the seriousness of the problems has attracted more and more attention. In order to eliminate such phenomena and purify the academic atmosphere, it is very urgent to establish an effective plagiarism identification system in addition to strengthening the education of students and formulating relevant laws and regulations.
There are several defects in the existing plagiarism detection technology and system. First, the existing prototype system has not made an analysis on the plagiarism method of plagiarizing a number of papers. Second, the existing prototype system has not included the calculation of the similarity of the text structure in the detection process, even if it is considered. The characteristics of the text structure are not comprehensive or unreasonable. Third, for the documents that have been plagiarized, the existing prototype system does not give the identification of the corresponding plagiarism types. For the very obvious type of plagiarism, it can not be quickly and accurately captured. The characteristics of the plagiarism are analyzed. Finally, the complete plagiarism and partial plagiarism in the academic papers are identified by the digital fingerprint method similar to COPS. The word frequency statistics based on text information is used to identify the hidden plagiarism, and the experimental comparison of the improved methods using the P-R and MAP indexes is carried out.
【学位授予单位】：东北师范大学
【学位级别】：硕士
【学位授予年份】：2009
【分类号】：TP311.52

【引证文献】