流形学习及其在文本分类中的应用
[Abstract]:With the increasing of computer capability and the increase of storage capacity, large-scale data acquisition is more convenient and universal, but also brings about new problems. In many fields, such as text mining, biometric authentication, image analysis and computer vision, text analysis and computational biology in information retrieval, high-dimensional data are obtained, which may lead to "dimensionality disaster". In recent years, manifold learning has become a hot research field in the field of machine learning. Manifold learning expects to find the hidden regularity and structure of data from high-dimensional data space and is widely used in high-dimensional data dimension reduction. It is a nonlinear data dimension reduction method. Text classification, as the technical foundation of information retrieval, search engine, text database, digital library and so on, has a wide application prospect. Because of the unstructured feature of text data, the feature vector reaches tens of thousands and even hundreds of thousands of dimensions. The feature of high dimension will greatly increase the redundant feature information, which leads to the decrease of classification accuracy. Data dimensionality reduction can reduce the dimension of text vectors and make feature vectors better represent text or category features. In this paper, we assume that there is a potential text manifold in text vector space, consider the text as a sampling point on the manifold, apply manifold learning to the text preprocessing process of text classification, and propose a Bagging text classification algorithm based on ISOMAP. This paper describes the relevant theories and the specific flow of the algorithm, improves the ISOMAP algorithm incrementally, proposes a Bagging text classification algorithm based on incremental manifold learning, and makes experimental comparison and analysis. Experimental results show that manifold learning can effectively improve the performance of text classification.
【学位授予单位】:合肥工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【参考文献】
相关期刊论文 前8条
1 顾益军,樊孝忠,王建华,汪涛,黄维金;中文停用词表的自动选取[J];北京理工大学学报;2005年04期
2 沈学华,周志华,吴建鑫,陈兆乾;Boosting和Bagging综述[J];计算机工程与应用;2000年12期
3 张翔;周明全;耿国华;侯凡;;面向中文文本分类的C4.5Bagging算法研究[J];计算机工程与应用;2009年26期
4 王煜,王正欧;基于模糊决策树的文本分类规则抽取[J];计算机应用;2005年07期
5 张秋余;竭洋;李凯;;基于模糊支持向量机与决策树的文本分类器[J];计算机应用;2008年12期
6 巩知乐;张德贤;胡明明;;一种改进的支持向量机的文本分类算法[J];计算机仿真;2009年07期
7 程红莉;周宁;肖爽;;文本驱动的商务智能研究[J];情报科学;2007年10期
8 王晓慧;;线性判别分析与主成分分析及其相关研究评述[J];中山大学研究生学刊(自然科学、医学版);2007年04期
相关博士学位论文 前4条
1 王靖;流形学习的理论与方法研究[D];浙江大学;2006年
2 刘小明;数据降维及分类中的流形学习研究[D];浙江大学;2007年
3 谷瑞军;基于流形学习的高维空间分类器研究[D];江南大学;2008年
4 赵凌潇;基于流形的半监督分类方法研究[D];浙江大学;2009年
相关硕士学位论文 前4条
1 李木;基于Rocchio算法的增量式主题爬行[D];吉林大学;2007年
2 侯晓宇;基于流形学习的特征提取方法研究[D];大连理工大学;2009年
3 李晓红;中文文本分类技术研究[D];兰州理工大学;2009年
4 陆捷荣;基于流形学习与D-S证据理论的语音情感识别研究[D];江苏大学;2010年
本文编号:2373390
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2373390.html