流形学习及其在文本分类中的应用

发布时间：2018-12-11 23:06

【摘要】：随着计算机能力的日益增强和存储容量的增长，大规模的数据获取更为方便和普遍，同时也产生了新的问题。在很多领域中，如文本挖掘、生物特征认证、图像分析和计算机视觉、信息检索中的文本分析和计算生物学等，获得的是高维数据，这样极有可能导致“维数灾难”的出现。近年来，流形学习成为了机器学习领域的一个热点研究方向，流形学习期望从高维数据空间中寻找数据隐含的规律性与结构，被广泛用于高维数据降维，是一种非线性数据降维方法。文本分类作为信息检索、搜索引擎、文本数据库、数字化图书馆等领域的技术基础，有着广泛的应用前景。由于文本数据的非结构化特点，进行文本表示时，特征向量高达几万维甚至于几十万维。高维的特点会大大增加冗余特征信息，从而导致分类的准确度下降。数据降维能够减少文本向量的维数，而使特征向量能更好地代表文本或者类别特征。本文假设文本向量空间存在一个潜在的文本流形，将文本看做是这个流形上抽样的点，将流形学习应用在文本分类的文本预处理过程中，提出了一种基于ISOMAP的Bagging文本分类算法，比较完整地描述了相关理论基础及算法的具体流程，并对ISOMAP算法进行了增量式改进，，提出了一种基于增量流形学习的Bagging文本分类算法，并进行了实验比较和分析，实验证明了流形学习在文本分类中的应用，能有效提高文本分类的性能。
[Abstract]:With the increasing of computer capability and the increase of storage capacity, large-scale data acquisition is more convenient and universal, but also brings about new problems. In many fields, such as text mining, biometric authentication, image analysis and computer vision, text analysis and computational biology in information retrieval, high-dimensional data are obtained, which may lead to "dimensionality disaster". In recent years, manifold learning has become a hot research field in the field of machine learning. Manifold learning expects to find the hidden regularity and structure of data from high-dimensional data space and is widely used in high-dimensional data dimension reduction. It is a nonlinear data dimension reduction method. Text classification, as the technical foundation of information retrieval, search engine, text database, digital library and so on, has a wide application prospect. Because of the unstructured feature of text data, the feature vector reaches tens of thousands and even hundreds of thousands of dimensions. The feature of high dimension will greatly increase the redundant feature information, which leads to the decrease of classification accuracy. Data dimensionality reduction can reduce the dimension of text vectors and make feature vectors better represent text or category features. In this paper, we assume that there is a potential text manifold in text vector space, consider the text as a sampling point on the manifold, apply manifold learning to the text preprocessing process of text classification, and propose a Bagging text classification algorithm based on ISOMAP. This paper describes the relevant theories and the specific flow of the algorithm, improves the ISOMAP algorithm incrementally, proposes a Bagging text classification algorithm based on incremental manifold learning, and makes experimental comparison and analysis. Experimental results show that manifold learning can effectively improve the performance of text classification.
【学位授予单位】：合肥工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】