基于迁移学习的P2P流量识别研究

发布时间：2018-07-31 16:31

【摘要】：随着基于P2P技术的互联网应用的大规模发展和用户数量的激增,由于P2P技术对网络资源的消耗,数据传输网络在建设和维护上面临着越来越大的压力。如何管理好P2P应用,使之能够在现有网络资源下健康发展是国内外专家学者关注的热点问题。 P2P流量识别是管理好P2P应用的基础,其研究一直没有中断过,目前主要的算法有基于端口的检测识别技术、基于内容的扫描识别技术,以及基于流量特征的识别技术,各项技术在一定程度上解决了P2P流量识别的问题,但都有各自的缺陷。机器学习算法是当今计算机领域的热门研究方向,机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。目前已有不少机器学习算法能够对P2P流量进行有效识别,但是都要基于大量的手工标记的训练样本,且这些样本在网络情况快速变化后难以重复利用。本论文在迁移学习这一全新的机器学习框架下,结合传统机器学习算法提出新的技术方案来解决P2P流量识别问题,这类新算法可以在少量手工标记样本的情况下获得较好的识别正确率。本论文的主要贡献和创新之包括以下三点：第一、对文本分类领域的基于自适应提升的迁移学习方法进行了研究,将其引入P2P流量识别领域,并提出了更注重实时性的改进算法。基于自适应提升的迁移学习是一种在文本分类领域中使用的迁移学习方法,本论文将其与P2P流量识别特点相结合,通过调整辅助数据的权重,使其更有针对性的迁移到源数据中,组成综合训练集对分类器进行训练,最终得到一个可靠的P2P识别器。在此基础上,本论文还通过使用基于迭代错误率的辅助数据动态裁剪技术,去除了与源数据相差过大的辅助数据,加快了迭代速度,减少了时间消耗。仿真实验证明改进后的算法更具有实时性和应用性。第二、将传统的K近邻法与迁移学习框架相结合,提出了一种基于K近邻的迁移学习方法,将其用于P2P流量识别领域并在复杂度方面该改进了算法。该算法利用K近邻法筛选辅助数据,去除与源数据相差较大的辅助数据,使与源数据更相似的辅助数据与源数据组成综合训练集,共同训练可靠的P2P流量识别分类器。在此基础上,本论文还通过奇异值分解进行预分组,减少了K近邻法部分的计算量,仿真实验也证实了该算法的有效性,以及改进算法可以增强整个算法的实时性。第三、建立了一套简易的基于Java和Web的P2P流量识别系统,方便算法和数据集的检验和交流。该系统在上述两种算法的基础上,以Web为界面,Java语言为核心实现了这两种算法,并将其公开,使用者可以上传自己的数据集加以识别或下载他人的数据集,为P2P流量识别算法的交流提供了一个有效的平台。
[Abstract]:With the large-scale development of P2P technology based Internet application and the rapid increase of the number of users, data transmission network is facing more and more pressure in construction and maintenance because of the consumption of P2P technology to network resources. How to manage P2P applications well and enable them to develop healthily under the existing network resources is a hot issue that experts and scholars at home and abroad pay close attention to. P2P traffic identification is the foundation of managing P2P applications, and its research has not been interrupted. At present, the main algorithms are port based detection and identification technology, content-based scanning recognition technology, and traffic feature recognition technology. To some extent, each technology solves the problem of P2P traffic identification, but each has its own defects. Machine learning algorithm is a hot research direction in the field of computer nowadays. Machine learning algorithm is a kind of algorithm which can automatically analyze and obtain laws from data and use them to predict unknown data. At present, there are many machine learning algorithms that can effectively identify P2P traffic, but they are all based on a large number of manually labeled training samples, and these samples are difficult to reuse after the rapid change of network conditions. In this paper, under the new machine learning framework of migration learning, combined with the traditional machine learning algorithm, a new technical scheme is proposed to solve the P2P traffic identification problem. This new algorithm can obtain better recognition accuracy in the case of a small number of manually labeled samples. The main contributions and innovations of this thesis are as follows: first, the paper studies the migration learning method based on adaptive lifting in the field of text classification, and introduces it into the field of P2P traffic identification. An improved algorithm which pays more attention to real-time is put forward. Transfer learning based on adaptive lifting is a migration learning method used in the field of text classification. This paper combines it with the characteristics of P2P traffic identification and adjusts the weight of auxiliary data. So that it can migrate to the source data more pertinently, form the comprehensive training set to train the classifier, and finally get a reliable P2P recognizer. On this basis, this paper also uses the auxiliary data dynamic clipping technology based on iterative error rate to remove the auxiliary data which is too different from the source data, accelerate the iteration speed and reduce the time consumption. Simulation results show that the improved algorithm is more real-time and applicable. Secondly, by combining the traditional K-nearest neighbor method with the transfer learning framework, a K-nearest neighbor based transfer learning method is proposed, which is applied to P2P traffic identification and improves the algorithm in terms of complexity. The algorithm uses K-nearest neighbor method to filter the auxiliary data, removes the auxiliary data which is different from the source data, and makes the auxiliary data and the source data more similar to the source data to form a comprehensive training set, together to train a reliable P2P traffic classifier. On this basis, the algorithm is pregrouped by singular value decomposition, which reduces the computational cost of the K-nearest neighbor method. The simulation results show that the algorithm is effective and the improved algorithm can enhance the real-time performance of the whole algorithm. Thirdly, a simple peer-to-peer traffic identification system based on Java and Web is established to facilitate the verification and communication of algorithms and data sets. On the basis of the above two algorithms, the system realizes these two algorithms with Web as the core language, and exposes them. Users can upload their own data sets to identify or download the data sets of others. It provides an effective platform for the exchange of P2P traffic identification algorithms.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP18;TP393.02

【参考文献】