当前位置:主页 > 科技论文 > 软件论文 >

基于维基百科的多种类型文献自动分类研究

发布时间:2018-07-08 17:44

  本文选题:数字图书馆 + 文本分类 ; 参考:《武汉大学》2017年硕士论文


【摘要】:随着互联网的逐渐普及,这些新兴的网络文本资源以极快的速度增长,这导致传统的手工分类方法由于效率较低,难以及时、有效地对这些网络数字资源进行合理地分类管理,因此必须利用自动文本分类技术来对其进行分类组织。而当前的自动文本分类技术往往研究的是针对来自同种文献类型的文本资源,而数字图书馆作为一种新型图书馆,其面临的待分类整理的文献来自图书、期刊、网页等等多种领域且属于多种类型,目前针对多种文献类型的自动分类研究还有待完善,所以研究改进针对多种文献类型的自动分类算法对数字图书馆的成长与发展能起到显著的推动作用。本文通过介绍与分析当前文本分类方面的相关研究及主要技术,提出了一种通过基于维基百科的特征扩展来提高针对不同类型文献分类效果的分类方法。针对由不同文献类型所造成的特征不匹配问题,本文认为通过第三方语料库可以有效地将原本不匹配的特征词进行关联,从而解决在特征词不匹配的情形下无法对不同类型文本间进行语义相关度计算的问题。一方面可以丰富当前待分类文本的语义特征,与由不同类型文献训练来得到的分类器产生相匹配特征,同时还可以解决在文本分类问题中普遍存在的特征稀疏等问题。本文主要进行的研究内容如下:(1)本文以互联网上的文本内容爆炸式增长为背景,论述未来数字图书馆面对以几何级数增加的网络文本分类管理困难的问题,引出了多种类型文献自动分类技术研究的必要性。继而本文提出的通过特征扩展解决上述问题的思路,并通过论述与分析当前相关研究的成果与进展来论证本文提出的文本分类方法的可行性与适用性。(2)本研究提出了一种基于特征扩展的多种类型文献文本分类方法,其中特征扩展操作是消除不同类型文献自动分类时文本间语义差异的核心步骤。而在进行特征扩展前需要从训练文本中提取一部分特征词作为特征扩展候选词集。本研究在论述传统特征选择方法的不足并举例说明其缺点的基础上,继而提出对其进行改进的原理与方法,并通过计算表明新的特征选择方法确实能解决原有不足。最后,本文使用改进的特征选择方法进行特征扩展候选词集的提取,并通过实验对比证明该方法的有效性。(3)为解决对不同类型文献间进行自动分类时遇到的特征不匹配等问题,本文提出一种基于特征扩展的文本分类方法,使用维基百科计算的语义相关度来准确衡量特征词之间的相关程度。在对待分类文本完成特征扩展之后,本文使用LDA主题模型对数据进行表示建模,但传统的LDA模型不能正常地对带权特征词进行建模,故而本文又对LDA模型进行改进,提出一种加权LDA模型使其能对带权特征词进行同样的建模与求解,同时由于特征词被赋予了不同权重,所以也提高了LDA模型本身的精度和准确性。
[Abstract]:With the gradual popularization of the Internet, these new network text resources are growing at a very fast speed, which leads to the traditional manual classification method is difficult to manage these network digital resources in a reasonable and timely manner due to its low efficiency. Therefore, it is necessary to use automatic text classification technology to organize it. The current automatic text classification technology is often aimed at the text resources from the same type of literature, and the digital library, as a new type of library, faces the literature to be sorted out from books, periodicals. Web pages and other fields belong to a variety of types. At present, the research on automatic classification of various literature types needs to be improved. Therefore, the research and improvement of the automatic classification algorithm for various literature types can play a significant role in promoting the growth and development of digital libraries. Based on the introduction and analysis of the current research on text classification and its main techniques, this paper proposes a new method to improve the classification effect of different types of documents by extending the features based on Wikipedia. Aiming at the problem of feature mismatch caused by different literature types, this paper considers that the original mismatched feature words can be effectively correlated by the third party corpus. In order to solve the problem that the semantic relevance of different types of text can not be calculated in the case of feature mismatch. On the one hand, it can enrich the semantic features of the text to be classified, and match with the classifier trained by different types of literature. At the same time, it can also solve the problem of sparse feature in the text classification problem. The main research contents of this paper are as follows: (1) based on the explosive growth of text content on the Internet, this paper discusses the problem that the future digital library faces the difficult management of network text classification with geometric progression increase. The necessity of research on automatic classification of many kinds of documents is introduced. Then this paper puts forward the idea of solving the above problems by extending the features. The feasibility and applicability of the text classification method proposed in this paper are demonstrated by discussing and analyzing the achievements and progress of the current related research. (2) this paper proposes a method of text classification of various types of literature based on feature expansion. Feature extension is the key step to eliminate semantic differences between texts in automatic classification of different types of documents. Some feature words should be extracted from the training text as feature extension candidate words before feature expansion. On the basis of discussing the shortcomings of the traditional feature selection method and illustrating its shortcomings, the paper puts forward the principle and method of improving it, and shows by calculation that the new feature selection method can really solve the original deficiency. Finally, the improved feature selection method is used to extract the extended candidate word sets, and the experimental results show that the method is effective. (3) in order to solve the problem of feature mismatch in the automatic classification of different types of literature, In this paper, a text classification method based on feature extension is proposed, which uses the semantic relevance calculated by Wikipedia to accurately measure the correlation between feature words. After finishing the feature expansion of the classified text, this paper uses the LDA topic model to model the data representation, but the traditional LDA model can not model the weighted feature words normally, so the LDA model is improved in this paper. A weighted LDA model is proposed to model and solve the weighted feature words in the same way. At the same time, the accuracy and accuracy of the LDA model are improved because the feature words are given different weights.
【学位授予单位】:武汉大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【相似文献】

相关期刊论文 前10条

1 李政泽;韩毅;周斌;贾焰;;微博用户分类的特征词权重优化及推荐策略[J];信息网络安全;2012年08期

2 翟东海;杜佳;崔静静;聂洪玉;;基于双粒度模型的中文情感特征词提取研究[J];重庆邮电大学学报(自然科学版);2014年03期

3 李德容;干静;张s,

本文编号:2108218


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2108218.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c948e***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com