当前位置:主页 > 科技论文 > 软件论文 >

基于知网语义特征扩展的题名信息分类

发布时间:2018-05-16 10:04

  本文选题:期刊论文题名 + 短文本分类 ; 参考:《图书馆杂志》2017年02期


【摘要】:本文利用文本集内部的语义关联性,通过高频词和隐含主题两个不同粒度得到训练集的语义核心词集,然后将知网作为外部资源计算语义核心词集与测试集中特征词之间的相似度,将训练集中相似度大于某一阈值的特征词扩展到仅有题名作为内容的待分类文本中,最后用SVM算法进行分类。实验结果表明,在训练集与测试集仅为题名的情况下,当训练集为每类200篇时,提升效果最好,达到3.1%,但提升效果随训练集文本数的增加而下降;在训练集为题名加摘要,测试集为题名时,本文提出的分类算法在复旦语料和自建的期刊语料上的Macro_F1分别平均提高1.5%和3.1%,在Micro_F1上分别平均提高2.3%和5.3%。本文通过对特征稀疏的题名信息进行特征扩展,以期提高期刊论文题名的分类效果。
[Abstract]:In this paper, the semantic core word set of the training set is obtained by using the semantic relevance within the text set and two different granularity of high-frequency words and implicit topics. Then, the knowledge net is used as the similarity between the core semantic words set of external resources and the feature words in the test set, and the feature words whose similarity in training set is greater than a certain threshold are extended to the text to be classified with only the title of the title as the content. Finally, SVM algorithm is used to classify. The experimental results show that when the training set and the test set are only the title of the question, when the training set is 200 articles per class, the lifting effect is the best, reaching 3.1, but the lifting effect decreases with the increase of the text number of the training set. Under the title of the test set, the Macro_F1 of Fudan corpus and self-built periodical corpus are increased by 1.5% and 3.1% on average, and by 2.3% and 5.3% on Micro_F1, respectively. In order to improve the classification effect of the title of journal papers, this paper extends the sparse feature information of title.
【作者单位】: 武汉大学信息管理学院;武汉大学信息资源研究中心;
【基金】:社会科学基金项目“多种类型文本数字资源自动分类研究”(项目编号:15BTQ066)的研究成果之一
【分类号】:TP391.1


本文编号:1896422

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1896422.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户06b25***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com