基于MIMLRBF神经网络的网页分类方法

发布时间：2018-08-04 20:14

【摘要】：随着科技和网络的发展与普及,海量信息通过网络进行发布。为使人们能够从海量的网页中获得有用的信息,网页自动分类技术应运而生,它是一种基于机器学习的网页类别自动标注的方法。其中,多示例多标签学习框架下的RBF神经网络有出色的学习和分类能力,已成为机器学习的研究热点。介绍了RBF神经网络的发展历程、原理以及相关技术,分析了常用的RBF神经网络的训练分类算法,研究了多示例多标签这种新型的学习框架以及相关算法,重点讨论了用RBF神经网络来解决多示例多标签问题而提出的MIMLRBF神经网络算法。在不平衡样本集情况下,MIMLRBF神经网络产生隐层神经元数量不平衡,训练时忽略了样本较少的类,使得分类效果变差。针对此问题,本文提出了改进算法,首先确定样本较少的类,根据此类别的样本数量为每个类选择一定数量并且相距较远的初始聚类中心;对于各类别中的剩余样本,根据此类样本数量大小来判断能否作为一个新的聚类中心;最后利用相关算法优化中心对象。一个聚类中心对应一个隐层神经元,这样就可以根据各类样本的数量动态确定隐层神经元的数量,使其趋于平衡,减少了不平衡性对网络性能的影响。经典的MIMLRBF神经网络算法为每一个径向基函数选择统一的宽度参数值,没有考虑到每个中心点附近的样本分布疏密情况。针对此问题,本文提出了考虑类簇内样本分布的改进算法。首先利用相关算法找到各个类别的中心点,计算中心点之间的平均距离和方差,中心点的分布反映整体样本集的分布情况;然后计算每个类簇中样本分布的方差,这个值反映了每个类簇内样本的分布情况;最后根据类簇分布和整体样本分布为每个径向基函数确定适当的宽度值,从而使得整个网络更趋于平滑。最后,将本文算法与三个经典算法在两个通用数据集上做实验对比,并将改进算法应用于网页分类系统中。实验数据表明,本文提出的改进算法的分类效率和效果更胜一筹。
[Abstract]:With the development and popularization of science and technology and network, massive information is released through the network. In order to obtain useful information from a large number of web pages, the automatic classification technology of web pages emerges as the times require. It is a kind of automatic labeling method of web pages based on machine learning. Among them, the RBF neural network with multi-example and multi-label learning framework has excellent learning and classification ability, and has become a research hotspot in machine learning. This paper introduces the development course, principle and related technology of RBF neural network, analyzes the training classification algorithm of RBF neural network, and studies the new learning framework and related algorithm of multi-example and multi-label. The MIMLRBF neural network algorithm which uses RBF neural network to solve multi-example and multi-label problem is discussed. In the case of unbalanced sample set the MIMLRBF neural network produces an imbalance in the number of hidden layer neurons and neglects the classes with fewer samples during training which makes the classification effect worse. In order to solve this problem, an improved algorithm is proposed. Firstly, the class with fewer samples is determined. According to the number of samples in this class, the initial cluster center is selected for each class, which is far away from each class. According to the size of the samples, we can determine whether we can be a new cluster center. Finally, we use the correlation algorithm to optimize the center object. A cluster center corresponds to a hidden layer neuron, so the number of hidden layer neurons can be determined dynamically according to the number of samples, which tends to balance and reduces the effect of imbalance on network performance. The classical MIMLRBF neural network algorithm selects a uniform width parameter value for each radial basis function without considering the density of samples near each center point. To solve this problem, an improved algorithm considering the distribution of samples in clusters is proposed. First, the correlation algorithm is used to find the center points of each class, and the average distance and variance between the center points are calculated. The distribution of the center points reflects the distribution of the whole sample set, and then the variance of the sample distribution in each cluster is calculated. This value reflects the distribution of samples in each cluster, and finally determines the appropriate width value for each radial basis function according to the cluster distribution and the overall sample distribution, thus making the whole network smoother. Finally, the algorithm is compared with three classical algorithms on two general data sets, and the improved algorithm is applied to the web page classification system. Experimental data show that the classification efficiency and effect of the proposed improved algorithm are superior.
【学位授予单位】：中国石油大学(华东)
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP183

【相似文献】