基于局部敏感性哈希的代码相似性检测技术研究

发布时间：2018-12-21 15:37

【摘要】：随着社会的进步,软件系统被广泛运用在日常生活的各个方面,软件系统的代码量不断提升。各类软件系统都有可能面临系统重构和知识产权保护的需求,这就需要对于软件系统源代码进行相似性检测。但是,近些年提出的检测技术对检测效率的提升还有进一步提高的空间。本文通过将局部敏感性哈希应用在高维源代码关键信息矩阵中的近邻搜索,从而快速的得到相似对查询结果。源代码会首先被进行预处理,将源代码转换成q-gram的标识序列。代码段的q-gram标识序列集合通过基于Jplag的相似代码块检测技术进行分类并利用局部敏感性哈希算法将分类结果放置到不同的桶中,在桶中的数据会被重新组织为前缀树形式的数据结构。为了定位不同代码集合中的相似代码对,需要对查询代码段进行相同的哈希运算,可以得到相似概率大于给定阈值的代码块前缀树,利用前缀树的搜索算法对相似代码对进行准确的定位。由于在使用局部敏感性哈希进行分类的过程中就已经将相似的代码块分到同一个桶中,所以可以有效的降低无效查找的时间成本,提高处理效率。本文对基于局部敏感性哈希的代码相似性检测方法进行了研究与分析,提出了使用局部敏感性哈希进行相似代码对快速检索的方法。实验证明了相比原有方法本文提出的方法处理效率提高了约10%。
[Abstract]:With the development of society, software system is widely used in every aspect of daily life. All kinds of software systems are likely to face the need of system reconstruction and intellectual property protection, which requires the similarity detection of the source code of the software system. However, the detection technology proposed in recent years has further improved the detection efficiency. In this paper, the local sensitive hash is applied to the nearest neighbor search in the key information matrix of high dimensional source code, and the similar pair query results are obtained quickly. The source code is preprocessed to convert the source code into an identity sequence of q-gram. The q-gram identification sequence set of the code segment is classified by the similar code block detection technique based on Jplag, and the classification results are placed in different buckets by using the local sensitive hash algorithm. The data in the bucket is reorganized into a data structure in the form of a prefix tree. In order to locate the similar code pairs in different code sets, we need the same hash operation on the query code segment, and we can get the code block prefix tree whose similarity probability is greater than a given threshold. The prefix tree search algorithm is used to locate the similar code pairs accurately. Because similar code blocks have been divided into the same bucket in the process of using local sensitive hash, it can effectively reduce the time cost of invalid lookup and improve the processing efficiency. In this paper, the method of code similarity detection based on local sensitive hashing is studied and analyzed. The experimental results show that the efficiency of the proposed method is about 10% higher than that of the original method.
【学位授予单位】：天津工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.5

【参考文献】