面向行业的信息融合原型系统的研究与实现

发布时间：2018-09-05 07:08

【摘要】：随着信息产业的不断飞速发展壮大,网络上的数据每天都在以惊人的速度不断的增长。用户在查询中越来越多的包含实体的信息,例如人名、机构名等,试图通过围绕实体来构建有意义的查询条件,从语义的方面查找到与这些实体相关的信息,而不仅仅通过关键词来进行信息搜索与查询。基于文档级别进行索引的通用搜索引擎,例如谷歌、百度、雅虎等,都是基于关键词匹配的文档检索,在一定程度上已经开始不能满足互联网用户的搜索需要,人们期望以实体为中心的搜索系统的出现。本文调研了上述搜索引擎的不足以及用户搜索的习惯,提出了基于实体关联模型的信息融合方法,通过机器学习构建面向行业的网页信息融合原型系统,以实体为中心将信息进行融合,目的在于利用实体的概念将信息以实体为中心集成起来,更方便于普通互联网用户有效的进行以实体为中心的搜索。本文主要进行的研究工作如下：首先,基于百度百科,通过词条的抽取、分类、整理,得到一个基于IT行业领域的实体词典。其次,收集各大门户网站中的IT新闻文本以及IT行业知名博客,通过网页抽取技术,整理并构建了面向行业的中文新闻领域的语料库。然后,通过机器学习的方法构建面向行业的网页信息融合原型系统,利用基于图的排序算法计算出文本与实体的相关度,在语义理解的基础上得到文本中实体的权重,并根据实体在所出现的文本的权重计算出实体间的关联度。最后,在上述研究基础上,完成一个以实体为中心的搜索系统原型。本文在系统的实验中,使用已经构建好的基于中文新闻领域的语料库作为测试集,对该面向行业的信息融合原型系统进行了测试,实验结果表明,通过与人工标注的实体关联度进行对比,本文所构建的实体模型中,文本与实体的相关度以及实体间的关联度与人工标注的结果偏差大部分小于0.1,计算结果与人们的认知结果基本吻合,具有较高的准确率。
[Abstract]:With the rapid development of the information industry, the data on the network is growing at an alarming rate every day. More and more users in the query contain entity information, such as person name, organization name and so on. They try to construct meaningful query conditions around the entity, and find the relationship between these entities from the semantic aspect. General search engines based on document-level indexing, such as Google, Baidu, Yahoo, etc., are all based on keyword matching. To a certain extent, they have begun to fail to meet the search needs of Internet users, and people expect to be entity-centric. The emergence of search system.
This paper investigates the insufficiency of the above search engines and the user's habit of searching, and proposes an information fusion method based on entity Association model. Through machine learning, an industry-oriented web information fusion prototype system is constructed, which integrates information with entity as the center. The purpose is to use the concept of entity to set information with entity as the center. It is more convenient for ordinary Internet users to effectively conduct entity centric search.
The main research work of this paper is as follows: Firstly, based on Baidu Encyclopedia, an entity dictionary based on IT industry domain is obtained by extracting, classifying and sorting out entries. Secondly, the IT news texts and famous blogs of IT industry in major portals are collected, and the industry-oriented new Chinese language is sorted out and constructed by Web page extraction technology. Then, an industry-oriented web information fusion prototype system is constructed by means of machine learning. The relativity between text and entity is calculated by graph-based sorting algorithm, and the entity weight in text is obtained on the basis of semantic understanding. Finally, based on the above research, an entity centered search system prototype is completed.
In this paper, we use the corpus based on Chinese news domain as test set to test the industry-oriented information fusion prototype system. The experimental results show that the correlation between text and entity in the entity model constructed in this paper is better than that of manual annotation. And the deviation between the correlation degree between entities and the result of manual labeling is mostly less than 0.1. The calculated results are basically consistent with people's cognitive results and have high accuracy.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】