基于局部敏感性哈希的代码相似性检测技术研究
发布时间:2018-12-21 15:37
【摘要】:随着社会的进步,软件系统被广泛运用在日常生活的各个方面,软件系统的代码量不断提升。各类软件系统都有可能面临系统重构和知识产权保护的需求,这就需要对于软件系统源代码进行相似性检测。但是,近些年提出的检测技术对检测效率的提升还有进一步提高的空间。本文通过将局部敏感性哈希应用在高维源代码关键信息矩阵中的近邻搜索,从而快速的得到相似对查询结果。源代码会首先被进行预处理,将源代码转换成q-gram的标识序列。代码段的q-gram标识序列集合通过基于Jplag的相似代码块检测技术进行分类并利用局部敏感性哈希算法将分类结果放置到不同的桶中,在桶中的数据会被重新组织为前缀树形式的数据结构。为了定位不同代码集合中的相似代码对,需要对查询代码段进行相同的哈希运算,可以得到相似概率大于给定阈值的代码块前缀树,利用前缀树的搜索算法对相似代码对进行准确的定位。由于在使用局部敏感性哈希进行分类的过程中就已经将相似的代码块分到同一个桶中,所以可以有效的降低无效查找的时间成本,提高处理效率。本文对基于局部敏感性哈希的代码相似性检测方法进行了研究与分析,提出了使用局部敏感性哈希进行相似代码对快速检索的方法。实验证明了相比原有方法本文提出的方法处理效率提高了约10%。
[Abstract]:With the development of society, software system is widely used in every aspect of daily life. All kinds of software systems are likely to face the need of system reconstruction and intellectual property protection, which requires the similarity detection of the source code of the software system. However, the detection technology proposed in recent years has further improved the detection efficiency. In this paper, the local sensitive hash is applied to the nearest neighbor search in the key information matrix of high dimensional source code, and the similar pair query results are obtained quickly. The source code is preprocessed to convert the source code into an identity sequence of q-gram. The q-gram identification sequence set of the code segment is classified by the similar code block detection technique based on Jplag, and the classification results are placed in different buckets by using the local sensitive hash algorithm. The data in the bucket is reorganized into a data structure in the form of a prefix tree. In order to locate the similar code pairs in different code sets, we need the same hash operation on the query code segment, and we can get the code block prefix tree whose similarity probability is greater than a given threshold. The prefix tree search algorithm is used to locate the similar code pairs accurately. Because similar code blocks have been divided into the same bucket in the process of using local sensitive hash, it can effectively reduce the time cost of invalid lookup and improve the processing efficiency. In this paper, the method of code similarity detection based on local sensitive hashing is studied and analyzed. The experimental results show that the efficiency of the proposed method is about 10% higher than that of the original method.
【学位授予单位】:天津工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.5
本文编号:2389105
[Abstract]:With the development of society, software system is widely used in every aspect of daily life. All kinds of software systems are likely to face the need of system reconstruction and intellectual property protection, which requires the similarity detection of the source code of the software system. However, the detection technology proposed in recent years has further improved the detection efficiency. In this paper, the local sensitive hash is applied to the nearest neighbor search in the key information matrix of high dimensional source code, and the similar pair query results are obtained quickly. The source code is preprocessed to convert the source code into an identity sequence of q-gram. The q-gram identification sequence set of the code segment is classified by the similar code block detection technique based on Jplag, and the classification results are placed in different buckets by using the local sensitive hash algorithm. The data in the bucket is reorganized into a data structure in the form of a prefix tree. In order to locate the similar code pairs in different code sets, we need the same hash operation on the query code segment, and we can get the code block prefix tree whose similarity probability is greater than a given threshold. The prefix tree search algorithm is used to locate the similar code pairs accurately. Because similar code blocks have been divided into the same bucket in the process of using local sensitive hash, it can effectively reduce the time cost of invalid lookup and improve the processing efficiency. In this paper, the method of code similarity detection based on local sensitive hashing is studied and analyzed. The experimental results show that the efficiency of the proposed method is about 10% higher than that of the original method.
【学位授予单位】:天津工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.5
【参考文献】
相关期刊论文 前10条
1 石野;黄龙和;车天阳;高斯;王健;;基于语法树的程序相似度判定方法[J];吉林大学学报(信息科学版);2014年01期
2 汤春蕾;董家麒;;基于LSH的时间子序列查询算法[J];计算机学报;2012年11期
3 张丽萍;刘东升;李彦臣;钟美;;一种基于AST的代码抄袭检测方法[J];计算机应用研究;2011年12期
4 张丽萍;刘东升;李彦臣;;基于语法树的程序代码复制检测方法及其评价机制的研究[J];内蒙古大学学报(自然科学版);2010年05期
5 于冬琦;彭鑫;赵文耘;;使用抽象语法树和静态分析的克隆代码自动重构方法[J];小型微型计算机系统;2009年09期
6 蔡衡;李舟军;孙健;李洋;;基于LSH的中文文本快速检索[J];计算机科学;2009年08期
7 张莉;周祖林;;代码相似性检测在程序设计教学中的应用[J];计算机教育;2009年13期
8 邓爱萍;;程序代码相似度度量算法研究[J];计算机工程与设计;2008年17期
9 赵长海;晏海华;金茂忠;;基于编译优化和反汇编的程序相似性检测方法[J];北京航空航天大学学报;2008年06期
10 鲍军鹏,沈钧毅,刘晓东,宋擒豹;自然语言文档复制检测研究综述[J];软件学报;2003年10期
,本文编号:2389105
本文链接:https://www.wllwen.com/falvlunwen/zhishichanquanfa/2389105.html