时空要素驱动的事件网页信息检索方法研究
发布时间:2018-01-21 23:33
本文关键词: 网页文本 事件 时空要素 检索 “时间—空间—主题”索引 出处:《南京师范大学》2013年硕士论文 论文类型:学位论文
【摘要】:本文依托国家“863”课题“泛在空间信息关联更新与面向主题时空信息挖掘研究”,探索面向事件的网页文本获取与检索服务方法,为多源网络信息的结构化表达、事件时空序列重构、可视化和挖掘分析提供数据支撑。本文围绕事件网页文本“数据获取—组织管理—检索服务”的技术主线,通过分析中文网页文本中事件信息的语言描述和信息组织特征,以自然灾害事件为例,开展了时空要素驱动的事件网页信息检索引擎关键技术研究。主要研究内容与结论包括以下几个方面: (1)时空要素驱动的事件网页获取:通过对描述事件网页文本内容及特征进行分析,构建以时间、空间位置和事件主题为基本要素的事件表达模板;依据事件表达模板中的内容,定制网络爬虫以获取描述事件的网页文本。实验表明,与传统爬虫相比,基于事件表达模板构建的事件主题爬虫具有良好的网页过滤功能,获取的网页具有较高的精度,但是因为在主题爬虫中引入了大量的计算,导致该爬虫的性能相对有所下降。 (2)事件网页“时间—空间—主题”分布式索引与存储:利用规则模型和条件随机场模型实现了网页文本中事件相关时间、空间位置和主题信息抽取,提出了基于支持向量机模型的网页文本事件分类方法;构建了基于“时间—空间—主题”的分布式索引,以解决检索效率低的问题;基于HBase数据库和HDFS文件系统,实现了海量网页文本的分布式存储。 (3)“文—图”交互式事件网页信息检索服务:通过归纳总结用户检索语句的描述特点,实现了事件信息检索语句的自动解析;借鉴同义词林的词汇组织方式,构建自然灾害事件领域词汇知识库和相似度检索模型,实现了候选网页文本和检索条件的相似度计算与排序。 (4)原型系统设计与实现:基于本文提出的事件网页获取方法、分布式索引与存储方法、检索服务方法,利用Google Map API,设计了相应的原型系统;探讨了原型系统的体系架构,以及主要功能模块。
[Abstract]:Based on the national "863" project, "Research on the updating of Spatial Information Association and Topic-Oriented Spatio-temporal Information Mining", this paper explores the event-oriented web page text acquisition and retrieval services. This paper provides data support for structured expression of multi-source network information, reconstruction of temporal and spatial sequence of events, visualization and mining analysis. This paper focuses on the technology of "data acquisition, organization management and retrieval service" of event web page text. By analyzing the language description and information organization features of event information in Chinese web text, the natural disaster event is taken as an example. The key technologies of event information retrieval engine driven by spatiotemporal factors are studied. The main contents and conclusions include the following aspects: (1) event page acquisition driven by spatio-temporal elements: by analyzing the content and features of the text describing event pages, we construct an event expression template with time, space location and event theme as the basic elements; According to the content of the event expression template, the web crawler is customized to obtain the web page text describing the event. The experiment shows that compared with the traditional crawler. The event topic crawler based on the event expression template has a good web page filtering function, and the obtained web page has a high accuracy, but because of the introduction of a large number of calculations in the topic crawler. As a result, the performance of the reptile is relatively poor. 2) distributed index and storage of event page "time-space-topic": using rule model and conditional random field model to extract information of event related time, space and topic in web page text. A method of web page text event classification based on support vector machine (SVM) model is proposed. In order to solve the problem of low retrieval efficiency, a distributed index based on "time-space-topic" is constructed. Based on HBase database and HDFS file system, distributed storage of massive web page text is realized. (3) "text-Graph" interactive event page information retrieval service: by summarizing the description characteristics of user retrieval statements, the automatic parsing of event information retrieval statements is realized; The lexical knowledge base and similarity retrieval model of natural disaster event domain are constructed based on the lexical organization of synonym forest, and the similarity calculation and ranking of candidate web page text and retrieval conditions are realized. Design and implementation of prototype system: based on the event page acquisition method proposed in this paper, distributed index and storage method, retrieval service method, using Google Map API. The corresponding prototype system is designed. The architecture and main function modules of the prototype system are discussed.
【学位授予单位】:南京师范大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1;P208
【参考文献】
相关期刊论文 前5条
1 付剑锋;刘宗田;付雪峰;周文;仲兆满;;基于依存分析的事件识别[J];计算机科学;2009年11期
2 车庆男;;基于Lucene的索引系统分析和研究[J];内蒙古石油化工;2010年18期
3 谭红叶;赵铁军;王浩畅;;基于向量相似度计算的半监督的名实体识别[J];计算机工程与设计;2008年19期
4 邵秀丽;刘彬;张涛;;基于Nutch的垂直搜索引擎的设计和实现[J];计算机工程与设计;2011年02期
5 沈达阳,孙茂松,黄昌宁;基于统计的汉语分词模型及实现方法[J];中文信息;1998年Z1期
相关硕士学位论文 前1条
1 李勇君;基于Hadoop的海量期货数据的分布式存储和算法分析[D];天津大学;2012年
,本文编号:1452880
本文链接:https://www.wllwen.com/kejilunwen/dizhicehuilunwen/1452880.html