基于一类SVM的网络不良信息过滤算法研究

发布时间：2019-05-10 07:32

【摘要】：互联网的高速发展使得通过网络传输的文件监控和过滤成为一个热门课题。这些文件中可能包含了不良信息。网络流量中的信息包含着各种网络协议，可能被分片，编码。机器无法直接识别其中的需要监控的内容。而对于内容过滤，使用传统的基于字符串匹配的算法显然无法满足呈几何爆炸级别的信息增长的监管需求。虽然使用SVM确实可以提高分类效率，但依然存在维数过大，导致存储资源和计算能力浪费的现象。本文首先分析如何在众多网络协议中，根据协议本身的特点和协议状态机，对协议中包含的传输内容进行自动识别匹配，然后对数据流部分进行重组还原，并且进行必要的解码操作，以获得需要过滤的文本信息。本文重点研究了主流的应用层HTTP协议，FTP协议，SMTP协议和POP3协议，，以及主流的私有应用飞信协议，QQ协议和MSN协议。然后本文提出了一种针对如何有效减少SVM的维数的改进算法，提出通过使用三种特征简约对向量机的维数进行约束。这种算法的改进达到加快运算速度，节省存储空间、提高准确率的作用。实验表明在选用相同数量的特征词的前提下，基于文档频率，基于信息增益和开方拟合算法取舍向量机的特征值各有优缺点。在仅仅选取500个特征值的情况下，改进算法使得不良信息分类和过滤的正确率达到了80%以上。在选取超过1000个特征值的情况下，DF算法的正确率超过了90%。
[Abstract]:With the rapid development of the Internet, file monitoring and filtering through the network has become a hot topic. These files may contain bad information. The information in the network traffic contains a variety of network protocols, which may be sliced and encoded. The machine cannot directly identify what needs to be monitored. For content filtering, the traditional string matching algorithm can not meet the regulatory needs of geometric explosion level information growth. Although the use of SVM can improve the classification efficiency, there is still a phenomenon that the dimension is too large, which leads to the waste of storage resources and computing power. This paper first analyzes how to automatically identify and match the transmission content contained in the protocol according to the characteristics of the protocol itself and the protocol state machine in many network protocols, and then reorganize and restore the data flow part. And carry out the necessary decoding operation to obtain the text information that needs to be filtered. This paper focuses on the mainstream application layer HTTP protocol, FTP protocol, SMTP protocol and POP3 protocol, as well as the mainstream private applications such as Fetion protocol, QQ protocol and MSN protocol. Then this paper proposes an improved algorithm to reduce the dimension of SVM effectively, and proposes to use three kinds of feature reduction to constrain the dimension of vector machine. The improvement of this algorithm can accelerate the operation speed, save the storage space and improve the accuracy. Experiments show that on the premise of choosing the same number of feature words, based on document frequency, based on information gain and square fitting algorithm, the eigenvalues of vector machines have their own advantages and disadvantages. When only 500 eigenvalues are selected, the correct rate of classification and filtering of bad information is more than 80%. When more than 1000 eigenvalues are selected, the correct rate of DF algorithm is more than 90%.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.08

【参考文献】