文档中词语权重计算方法的改进

发布时间：2018-11-04 15:38

【摘要】：文本的形式化表示一直是文本检索、自动文摘和搜索引擎等信息检索领域关注的基础性问题。向量空间模型 (VectorSpaceModel)中的tf.idf文本表示是该领域里得到广泛应用并且取得较好效果的一种文本表示方法。词语在文本集合中的分布比例量上的差异是决定词语表达文本内容的重要因素之一 ,但现在tf.idf方法无法把握这一因素。针对这个问题 ,本文引入信息论中信息增益的概念 ,提出一种对tf.idf的改进方法tf.idf.IG文本表示方法。该方法将词语的信息增益作为一个文本表示的一个因子 ,来衡量词语在文本集合中分布比例在量上的差异。在文本分类实验中 ,tf.idf.IG文本表示的向量空间模型的分类效果要好于tf.idf方法 ,验证了改进方法tf.idf.IG的有效性和可行性。
[Abstract]:The formal representation of text has always been a basic problem in information retrieval such as text retrieval, automatic abstracting and search engine. Tf.idf text representation in vector space model (VectorSpaceModel) is a widely used and effective text representation method in this field. The difference in the distribution of words in the text set is one of the important factors that determine the text content, but now the tf.idf method can not grasp this factor. To solve this problem, this paper introduces the concept of information gain in information theory, and proposes an improved tf.idf.IG text representation method for tf.idf. In this method, the information gain of words is regarded as a factor of text representation to measure the quantitative difference in the distribution ratio of words in the text set. In the text classification experiment, the classification effect of vector space model represented by tf.idf.IG text is better than that of tf.idf method, which verifies the effectiveness and feasibility of the improved tf.idf.IG method.
【作者单位】：中国科学院计算技术研究所软件研究室!北京100080 中国科学院计算技术研究所软件研究室!北京100080 中国科学院计算技术研究所软件研究室!北京100080 中国科学院计算技术研究所软件研究室!北京100080
【基金】：973项目!(G19980 30 5 10 ) 国家自然基金!(6 97730 0 8) 国家 86 3项目!(86 3- 30 6 - 2D0 2 - 0 1- 3)
【分类号】：TP391

【相似文献】