网络信息文本挖掘若干问题研究

发布时间：2018-04-12 06:34

本文选题：文本挖掘 + 特征聚簇　；参考：《北京理工大学》2015年博士论文

【摘要】：面对规模庞大、维数极高的文本信息,如何设计合理的、便于扩展的文本挖掘算法已成为数据挖掘领域的热点方向。针对这一方向,本文对文本挖掘所涉及的若干问题进行了深入研究,主要创新点包含如下五方面:1.针对传统的向量空间模型维数过高并且不能处理同义词与近义词的问题,本文提出基于特征聚簇的向量空间模型,该模型首先将每个特征进行向量表示;然后将这些特征进行聚类,将得到的每一个聚簇整体作为一个特征;此外,针对专有名词的非连续短语进行识别,使得文本表示向量中的特征信息变得更为丰富、精准。这种方法不但能够有效降低文本向量的维度,而且能进一步体现文本特征之间的语义关系,因而能够提高文本挖掘的质量。实验结果证明,使用该方法得到的文本表示向量具有较高的特征约简率,聚类F值较传统方法也有明显的提升。2.传统的K-means算法对于初始中心点的选择是随机的,容易引起分析结果的波动。针对这一问题,本文提出一种基于相似度矩阵的K-means算法。该方法不再随机地选取初始聚簇中心,而是使用相似度矩阵有针对性地选择更加有效的初始聚簇中心,这样能为整个聚类过程产生一个良好的开端,也降低了初始中心点对最终的聚类结果所造成的不稳定性影响,从而能够取得较好的聚类质量。实验结果表明改进的算法使聚类的F值得到了明显的提高,并且聚类结果也比较稳定。3.针对文本挖掘应用面临的标注数据不充足的现象,本文提出半监督K-means算法。这种方法同时使用标注数据和未标注数据,它充分利用已标注数据的特点来辅助未标注数据的标注。该方法在选择初始点时,一部分使用标注数据的类别中心点,另一部分则使用距离已选的标注数据较远的未标注数据,这样能够保证初始点分属于不同的聚簇,从而获得较高准确率的结果。实验结果表明该算法是一种有效的方法,在一定程度上解决了标注数据不充足的问题。4.不均衡训练语料是一种普遍现象,它会造成分类质量的下降。针对这种现象,本文提出混合加权KNN算法。这种方法通过分析训练样本的分布情况,使用比例倒数加权,使得每个训练样本落到待分类样本区域中的可能性相等,从而不再受类别分布不均衡的影响,同时还结合距离加权,保证了训练样本距离待分类样本越近,其权重就会越大,获得比较理想的分类效果。实验结果表明该算法可以获得较好的分类准确率,是一种解决针对不均衡训练语料分类问题的有效方法。5.为了提高运算效率和便于处理大数据集,对本文提出的文本聚类和文本分类算法进行基于MapReduce的并行化处理,并把这些算法作为模块集成于一个完整的文本挖掘系统,实现文本挖掘全流程的自动化处理。实验结果表明对所改进算法的并行化处理,一方面没有影响文本挖掘的准确率,另一方面还大大提高了运行效率。
[Abstract]:In the face of large scale and high dimension text information, how to design reasonable and easy to expand text mining algorithm has become a hot topic in the field of data mining.Aiming at this direction, this paper makes a deep research on some problems involved in text mining. The main innovations include the following five aspects: 1.Aiming at the problem that the dimension of traditional vector space model is too high to deal with synonyms and synonyms, this paper proposes a vector space model based on feature clustering.Then these features are clustered and each cluster is taken as a feature. In addition, the discontinuous phrases of proper nouns are recognized, which makes the feature information in the text representation vector more abundant and accurate.This method not only can effectively reduce the dimension of text vector, but also can further reflect the semantic relationship between text features, so it can improve the quality of text mining.The experimental results show that the text representation vector obtained by this method has a higher feature reduction rate, and the clustering F value also has a significant improvement of .2. compared with the traditional method.The traditional K-means algorithm is random for the selection of initial center points, which can easily cause fluctuation of the analysis results.To solve this problem, this paper proposes a K-means algorithm based on similarity matrix.Instead of randomly selecting initial cluster centers, the method uses similarity matrix to select more effective initial clustering centers, which can make a good start for the whole clustering process.The effect of the initial center on the instability of the final clustering results is also reduced, so that the better clustering quality can be achieved.The experimental results show that the improved algorithm can significantly improve the F value of the clustering, and the clustering results are also relatively stable. 3.In this paper, a semi-supervised K-means algorithm is proposed to solve the problem of insufficient annotated data in text mining applications.This method uses both annotated data and unannotated data, and makes full use of the characteristics of annotated data to assist in the tagging of unannotated data.When selecting the initial point, one part uses the class center of the annotated data, the other part uses the unlabeled data which is far away from the selected tagged data, which can ensure that the initial points belong to different clusters.Thus, the result of higher accuracy is obtained.Experimental results show that the algorithm is an effective method, to some extent, the problem of insufficient tagging data. 4.Unbalanced training corpus is a common phenomenon, which can lead to the decline of classification quality.In view of this phenomenon, a hybrid weighted KNN algorithm is proposed in this paper.By analyzing the distribution of training samples and using proportional reciprocal weighting, the probability of each training sample falling into the region to be classified is equal, so that it is no longer affected by the unbalanced distribution of categories.At the same time, the distance weighting ensures that the closer the training sample is to the sample to be classified, the greater the weight of the training sample is, and the better the classification effect is.Experimental results show that the algorithm can achieve better classification accuracy, and it is an effective method to solve the problem of uneven training corpus classification.In order to improve the operation efficiency and facilitate the processing of big data set, the text clustering and text classification algorithms proposed in this paper are parallelized based on MapReduce, and these algorithms are integrated into a complete text mining system as a module.The automatic processing of the whole process of text mining is realized.The experimental results show that the parallelization of the improved algorithm does not affect the accuracy of text mining on the one hand, and improves the running efficiency greatly on the other hand.
【学位授予单位】：北京理工大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：TP391.1

【共引文献】