基于特征选择的文本分类方法研究及其应用

发布时间：2018-03-31 12:57

本文选题：文本分类　切入点：特征选择　出处：《江南大学》2017年硕士论文

【摘要】：随着计算机技术的不断发展,网络信息数据呈爆发式增长,这些信息在丰富人们生活的同时,也产生了很多无用甚至有害的信息,给信息的合理有效应用带了困难和挑战。如何在众多数据中准确寻找到对自己有用的信息,已成为信息技术领域有待进一步解决的问题。而文本分类技术为这一问题提供有效的解决方案,传统基于专家知识的人工分类方法花费大量人力和时间成本,已难以适应现代社会数据的增长,随着科学发展,出现了自动文本分类方法。特征选择方法是文本分类中不可或缺的技术,其对特征的选取能力将严重影响类别判断的结果。本文主要针对传统的卡方统计特征选择方法未能充分考虑类内词频和特征项分布情况,提出了一种关于类内信息优化卡方统计的特征选择方法。在分类方法中,支持向量机作为文本自动分类方法中最典型的机器学习方法之一,具有简单、高效,且分类准确率高等优点,不断受到众多学者的广泛关注。本文采用支持向量机进行文本分类,为进一步提高其分类精度,针对支持向量机中参数难以选择问题,提出改进人工蜂群算法优化支持向量机模型对文本进行分类,对基本人工蜂群算法的引领蜂和跟随蜂搜索策略进行改进,有效提高分类准确率。为拓宽文本分类方法的应用领域,构建基于人类p53癌症基因二级生物信息数据库作为文本分类的语料库,该数据库主要包含了多种癌症p53基因的外显子和内含子序列信息,为深入研究癌症提供良好的平台。同时提出了一种基于拟比对细胞神经网络的序列比对方法对数据库中的癌症p53基因进行序列比对分析,有效提高了序列比对的相似度,为进一步研究癌症文本分类提供了理论基础。
[Abstract]:With the development of computer technology, the data of network information is increasing explosively, which not only enriches people's life, but also produces a lot of useless and even harmful information, which brings difficulties and challenges to the rational and effective application of information.How to accurately find useful information in many data has become a problem to be solved in the field of information technology.Text classification technology provides an effective solution to this problem. The traditional manual classification method based on expert knowledge costs a lot of manpower and time, so it is difficult to adapt to the growth of modern social data, with the development of science.An automatic text categorization method appears.Feature selection is an indispensable technique in text categorization, and its ability to select features will seriously affect the result of category judgment.Aiming at the fact that the traditional chi-square statistical feature selection method fails to fully consider the word frequency and the distribution of feature items within the class, this paper proposes a feature selection method for optimizing chi-square statistics on intra-class information.As one of the most typical machine learning methods in automatic text classification, support vector machine (SVM) has the advantages of simplicity, high efficiency and high classification accuracy, so it has been paid more and more attention by many scholars.In this paper, support vector machine (SVM) is used for text classification. In order to improve the classification accuracy, an improved artificial bee colony algorithm is proposed to optimize the support vector machine model for text classification, aiming at the difficulty of selecting parameters in support vector machine (SVM).In order to improve the classification accuracy of the basic artificial bee colony algorithm, the search strategies of leading bee and following bee are improved.In order to widen the application field of text classification methods, the secondary biological information database of human p53 cancer gene is constructed as the corpus of text classification. The database mainly contains exon and intron sequence information of many kinds of cancer p53 gene.It provides a good platform for further research on cancer.At the same time, a sequence alignment method based on pseudo alignment cell neural network is proposed to analyze the cancer p53 gene sequence alignment in the database, which effectively improves the similarity of sequence alignment.It provides a theoretical basis for the further study of cancer text classification.
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】