基于标题与正文的文本分类和评价对象抽取方法研究
发布时间:2018-04-01 13:37
本文选题:主题模型 切入点:文本分类 出处:《安徽大学》2017年硕士论文
【摘要】:随着社会的发展,互联网信息呈现爆炸式的增长,通过观察网民提交的文本发现,大多数网站特别是新闻和政府的网站,文本信息都具有结构化的特点,通常包含标题文本和正文文本。正文通常是对事件详细的描述,包含的语义信息比较丰富,同时具有主题多样性,噪声巨大。标题通常是对事件的精炼简洁的概述,表达信息准确,语义清晰,所以充分利用标题信息就变得十分有意义。本文充分利用标题的特点,提出了基于标题和正文的主题模型应用于文本分类研究。由于标题的特殊性,语句简短,句法简单,所以本文基于规则和句法依存关系可以有效的提取标题中的评价对象。本文主要工作如下:(1)本文利用一篇文档具有标题和正文两部分的特点,提出了基于标题和正文的主题模型,该模型可以获得文档正文的主题分布和标题的主题分布,使用调节参数,优化整篇文档的主题分布。充分利用标题具有精炼简洁、主题明确的优点,可以有效的降低正文部分语义繁杂、主题多样性对文本分类的影响,从而获得整篇文档最优的主题分布,通过最佳的主题分布,可以提高文本分类的准确性。(2)由于标题精炼简洁,主题明确,因此采用句法依存关系获取标题中的评价对象。本文基于规则和词性标注获取标题中潜在的评价对象,因为本文标题语料的特殊性,潜在的评价对象和动词具有很强的依赖关系,所以本文构建动词词典库,通过动词出现在句法分析树的位置,遍历整个句法分析树,可以从潜在的评价对象中找到标题中真实的评价对象。(3)由于本文的语料是来自某城市的政府直通车网站,解决当地城市居民所面临的问题,所以文本中出现了大量的当地特有的命名实体,为了解决这些特有的词汇对文本分词和句法依存关系的影响,本文加入了大量的当地特有的小区名,道路名,公交地铁名等名词作为用户词典,由于分词具有较好的准确性,所以在文本分类和评价对象的抽取的任务中都获得了不错的效果。
[Abstract]:With the development of society, the Internet information is increasing explosively. By observing the text submitted by netizens, it is found that most websites, especially news and government websites, have structural characteristics of text information. Usually contains title text and text text. The text is usually a detailed description of the event, which contains a wealth of semantic information, at the same time, it has a variety of topics and a lot of noise. The title is usually a concise and concise overview of the event. The expression information is accurate and the meaning is clear, so it becomes very meaningful to make full use of the title information. In this paper, we put forward the topic model based on the title and the text to apply to the text classification research, because of the particularity of the title. The sentence is short and the syntax is simple, so this paper can extract the evaluation object from the title effectively based on rules and syntactic dependencies. The main work of this paper is as follows: 1) this paper uses a document with the characteristics of title and text. A topic model based on title and text is proposed. The model can obtain the topic distribution of the document body and title, and optimize the topic distribution of the whole document by adjusting the parameters. The full use of the title is concise and concise. The advantages of topic clarity can effectively reduce the semantic complexity of the text and the influence of topic diversity on text classification, so that the optimal topic distribution of the whole document can be obtained, and the optimal topic distribution can be obtained through the optimal topic distribution. It can improve the accuracy of text categorization. (2) because the title is concise and the subject is clear, the syntactic dependency relation is used to obtain the evaluation object in the title. This paper obtains the potential evaluation object in the title based on rules and part of speech tagging. Because of the particularity of the title corpus, the potential object of evaluation and the verb have very strong dependence, so this paper constructs the verb dictionary, and traverses the whole parse tree through the verb appearing in the position of the syntactic parse tree. We can find the true evaluation object in the title from the potential evaluation object.) since the corpus of this paper is a government through train website from a certain city, it can solve the problems faced by the local urban residents. In order to solve the influence of these special words on the text participle and syntactic dependency, this paper adds a large number of local unique community names, road names, in order to solve the problem that there are a lot of local naming entities in the text. As a dictionary of users, the names of public transportation subway and other nouns have achieved good results in the task of text classification and evaluation object extraction because of the good accuracy of participle.
【学位授予单位】:安徽大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 董红斌;滕旭阳;杨雪;;一种基于关联信息熵度量的特征选择方法[J];计算机研究与发展;2016年08期
2 刘世成;韩笑;王继业;张东霞;朱朝阳;邓春宇;王晓蓉;;“互联网+”行动对电力工业的影响研究[J];电力信息与通信技术;2016年04期
3 蒲国林;;基于粗糙集与信息增益的情感特征选择方法[J];微电子学与计算机;2016年01期
4 饶高琦;于东;荀恩东;;基于自然标注信息和隐含主题模型的无监督文本特征抽取[J];中文信息学报;2015年06期
5 金元浦;;“互联网+”与“创客”时代[J];理论导报;2015年10期
6 杨佳能;阳爱民;周咏梅;;基于语义分析的中文微博情感分类方法[J];山东大学学报(理学版);2014年11期
7 高海英;金晨辉;张军琪;;基于卡方统计量的多差分攻击方法[J];电子学报;2014年09期
8 肖红;许少华;;基于句法分析和情感词典的网络舆情倾向性分析研究[J];小型微型计算机系统;2014年04期
9 来斯惟;徐立恒;陈玉博;刘康;赵军;;基于表示学习的中文分词算法探索[J];中文信息学报;2013年05期
10 缪有栋;邱锡鹏;黄萱菁;;一种适用于大规模网页分类的快速算法[J];计算机应用与软件;2012年07期
,本文编号:1695848
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1695848.html