当前位置:主页 > 科技论文 > 软件论文 >

基于多指标融合的文本特征评价及选择算法

发布时间:2019-03-13 09:20
【摘要】:在文本分类问题中,有多种评价特征优劣的指标,其中主要有特征与类别的相关性、特征自身的冗余度和特征在语料中的稀疏程度。由于文本特征的优劣直接影响分类效果,全方位考虑特征的各个因素很有必要。特征选择常分为三步骤分别对相关性、冗余度和稀疏程度进行衡量,而在每一步的加权和筛选过程中都要耗费大量时间,在面对实时性和准确性要求较高的情况时,这种分步评价特征的方法很难适用。针对上述问题,首先建立坐标模型,将相关性、冗余度和稀疏程度映射到坐标系中,根据空间内的点和原点构成的向量与坐标面或坐标轴的夹角对文本特征进行加权和筛选,从而将多个评价指标整合为一个评价指标,大幅节省了多次加权和筛选所耗费的时间,提高了特征选择效率。在复旦大学中文文本语料库和网易文本语料库中的实验结果表明,相比于分步法,基于多指标融合的文本特征评价及选择算法能够更快、更准地筛选词汇和n-grams特征,并在支持向量机(Support Vector Machine,SVM)中验证了特征在分类时的有效性。
[Abstract]:In the problem of text classification, there are a variety of indicators to evaluate the advantages and disadvantages of features, including the correlation between features and categories, the redundancy of features themselves and the sparse degree of features in the corpus. Because the advantages and disadvantages of the text features directly affect the classification effect, it is necessary to consider all the factors of the features in an all-round way. Feature selection is often divided into three steps to measure the correlation, redundancy and sparsity respectively. However, it takes a lot of time in each step of the weighting and screening process, and in the face of real-time and high accuracy requirements, This method of step-by-step evaluation of features is difficult to apply. In order to solve the above problems, firstly, the coordinate model is established, and the correlation, redundancy and sparsity are mapped to the coordinate system. The text features are weighted and screened according to the vector of the point and origin in the space and the angle between the coordinate plane or the coordinate axis. As a result, the multiple evaluation indexes are integrated into one evaluation index, which greatly saves the time of multiple weighting and screening, and improves the efficiency of feature selection. The experimental results in the Chinese text corpus of Fudan University and NetEase text corpus show that the multi-index fusion-based text feature evaluation and selection algorithm is faster and more accurate than the step-by-step method in selecting vocabulary and n-grams features. The validity of the feature in classification is verified in support vector machine (Support Vector Machine,SVM).
【作者单位】: 辽宁工程技术大学软件学院;
【基金】:国家自然科学基金(No.70971059) 辽宁省创新团队项目(No.2009T045) 辽宁省高等学校杰出青年学者成长计划(No.LJQ2012027)
【分类号】:TP391.1


本文编号:2439266

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2439266.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户9fa54***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com