当前位置:主页 > 科技论文 > 搜索引擎论文 >

相关实体查找与主页查找研究

发布时间:2018-06-06 21:11

  本文选题:TREC + REF ; 参考:《北京邮电大学》2013年硕士论文


【摘要】:REF (Related Entity Finding,相关实体查找)是TREC (Text Retrieval Conference,文本检索会议)实体检索中非常有前景的研究课题,对它的研究将对搜索引擎和人们对网络信息的处理方式带来巨大的改变。REF的要求是根据提供的topic的信息,通过互联网和相关数据库抽取出与topic相对应的相关实体答案以及对应实体主页。本文对国内外的现状和一些前沿的算法进行了研究,并对关键词的提取和扩展,文本的检索,段落的切分和相关度计算,命名实体识别,实体排序和支撑文档的检索等几个方面逐个分析和研究,对实现过程的改进和创新如下: (1)对于以往的对整个网页文本进行处理的方式做了改进,增加了对于短文本即段落的处理方式,从而剔除了大量的不相关文本内容,减小了返回文本的大小,提高了系统的处理效率。 (2)根据Wikipedia的结构特点,利用Wikipedia中的同义词和上位词等构建基于Wikipedia的类别词典,并用于实体抽取部分,适应了今年REF项目的实体类型多而细的特点,同时提高了实体抽取的准确率。 (3)添加了基于词密度的算法,实现了对DCM模型结果的校对,取得了比较好的效果。并根据去年的答案对DCM文档中心模型的计算公式中的参数做了调整,对模型进行了改进。
[Abstract]:Ref / related entity search is a very promising research topic in TREC / text Retrieval Conference. The research on it will bring great changes to the search engine and the way people deal with the information on the network. The requirements of the. Ref are based on the information provided by the topic. Through the Internet and related databases to extract the corresponding topic related entity answers and the corresponding entities home page. In this paper, the current situation at home and abroad and some advanced algorithms have been studied, and the keyword extraction and extension, text retrieval, paragraph segmentation and correlation calculation, named entity recognition, Several aspects, such as entity sorting and supporting document retrieval, are analyzed and studied one by one. The improvement and innovation of the implementation process are as follows: 1) improving the way of dealing with the whole web page text in the past. The method of processing short text is added to eliminate a large amount of irrelevant text content, reduce the size of returned text, and improve the efficiency of the system. Using Wikipedia synonyms and upper words to build a Wikipedia based category dictionary, which is used for entity extraction, adapts to the characteristics of this year's ref project, which is characterized by a large number of entity types. At the same time, the accuracy of entity extraction is improved. (3) an algorithm based on word density is added to proofread the results of DCM model. According to last year's answer, the parameters in the formula of DCM document center model are adjusted, and the model is improved.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前5条

1 周雅倩,郭以昆,黄萱菁,吴立德;基于最大熵方法的中英文基本名词短语识别[J];计算机研究与发展;2003年03期

2 余正涛;毛存礼;邓锦辉;章程;郭剑毅;;基于模式学习的中文问答系统答案抽取方法[J];吉林大学学报(工学版);2008年01期

3 付鸿鹄;张晓林;;段落检索及其相关算法研究[J];现代图书情报技术;2007年02期

4 宗萍;施水才;王涛;吕学强;;基于条件随机场的英文地理行政实体识别[J];现代图书情报技术;2009年02期

5 姚天顺,张俐,高竹;WordNet综述[J];语言文字应用;2001年01期



本文编号:1988123

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1988123.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2212b***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com