面向行业的信息融合原型系统的研究与实现
[Abstract]:With the rapid development of the information industry, the data on the network is growing at an alarming rate every day. More and more users in the query contain entity information, such as person name, organization name and so on. They try to construct meaningful query conditions around the entity, and find the relationship between these entities from the semantic aspect. General search engines based on document-level indexing, such as Google, Baidu, Yahoo, etc., are all based on keyword matching. To a certain extent, they have begun to fail to meet the search needs of Internet users, and people expect to be entity-centric. The emergence of search system.
This paper investigates the insufficiency of the above search engines and the user's habit of searching, and proposes an information fusion method based on entity Association model. Through machine learning, an industry-oriented web information fusion prototype system is constructed, which integrates information with entity as the center. The purpose is to use the concept of entity to set information with entity as the center. It is more convenient for ordinary Internet users to effectively conduct entity centric search.
The main research work of this paper is as follows: Firstly, based on Baidu Encyclopedia, an entity dictionary based on IT industry domain is obtained by extracting, classifying and sorting out entries. Secondly, the IT news texts and famous blogs of IT industry in major portals are collected, and the industry-oriented new Chinese language is sorted out and constructed by Web page extraction technology. Then, an industry-oriented web information fusion prototype system is constructed by means of machine learning. The relativity between text and entity is calculated by graph-based sorting algorithm, and the entity weight in text is obtained on the basis of semantic understanding. Finally, based on the above research, an entity centered search system prototype is completed.
In this paper, we use the corpus based on Chinese news domain as test set to test the industry-oriented information fusion prototype system. The experimental results show that the correlation between text and entity in the entity model constructed in this paper is better than that of manual annotation. And the deviation between the correlation degree between entities and the result of manual labeling is mostly less than 0.1. The calculated results are basically consistent with people's cognitive results and have high accuracy.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 芮璋现;肖海波;;支持向量机(SVM)及其应用[J];福建电脑;2007年04期
2 刘群,张华平,俞鸿魁,程学旗;基于层叠隐马模型的汉语词法分析[J];计算机研究与发展;2004年08期
3 陈永超;刘贵全;;一种基于命名实体的搜索结果聚类算法[J];计算机工程;2009年07期
4 夏天,樊孝忠,刘林;利用JNI实现ICTCLAS系统的Java调用[J];计算机应用;2004年S2期
5 徐冰;郭绍忠;黄永忠;;基于朴素贝叶斯分类算法的活跃网络结构挖掘[J];计算机应用;2007年06期
6 张华平,刘群;基于N-最短路径方法的中文词语粗分模型[J];中文信息学报;2002年05期
7 孙承杰,关毅;基于统计的网页正文信息抽取方法的研究[J];中文信息学报;2004年05期
8 张学工;关于统计学习理论与支持向量机[J];自动化学报;2000年01期
9 李剑波;李小华;董树明;杨科华;;一种基于XML的Web信息抽取方法[J];情报杂志;2006年08期
10 寇月;申德荣;李冬;聂铁铮;;一种基于语义及统计分析的Deep Web实体识别机制[J];软件学报;2008年02期
相关博士学位论文 前1条
1 包胜华;基于Web的实体信息搜索与挖掘研究[D];上海交通大学;2008年
相关硕士学位论文 前2条
1 刘治华;面向主题的文档摘要技术研究[D];北方工业大学;2011年
2 刘占山;基于XML搜索引擎的研究[D];吉林大学;2007年
,本文编号:2223551
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2223551.html