多阶段混合属性的景点实体解析研究
发布时间:2018-04-29 14:24
本文选题:景点实体解析 + 多阶段 ; 参考:《江西师范大学》2015年硕士论文
【摘要】:实体解析是一个非常传统的研究方向,近年来又逐渐成为研究热点,基于领域的实体解析正是其热点之一。与通用实体解析不同的是,基于领域的实体解析需要全面地分析和捕获领域数据的特征,并充分地加以利用。通用实体解析方法通常是在单一阶段内一次性匹配特征数据来完成实体解析,这一方面会造成不同特征数据的相互干扰,另一方面也不利于有针对性地利用不同的特征数据,从而影响实体解析的精确度。因此,本文在旅游信息领域背景下,在对领域无关和基于领域实体解析文献综述的基础上,提出了一种基于多阶段混合属性的景点实体解析方法。本方法在不同旅游数据源中,在景点的不同属性中充分提取景点的特征信息,通过多个阶段设计相应算法多次利用相关特征信息,最终实现景点实体解析。其中,景点的不同属性包括景点名、景点所在地,以及景点简介等。实体解析分为两个阶段,第一阶段是利用景点简介中的名词信息,对不同旅游网站中的景点进行聚类;第二阶段是在聚类结果基础上,利用景点名和景点简介中的人名地名相似度信息,进行桶装算法实现实体解析。本论文创新点如下:(1).解决了基于旅游景点实体解析的问题;(2).提出了基于多阶段混合属性的景点完全实体消解框架,在不同阶段有针对性地利用实体属性的有效信息;(3).提出了一种景点名景点简介混合的景点相似度度量方法;(4).提出了一种基于最远初始中心点和轮廓系数评价函数的k-means聚类优化算法;(5).改造了一种桶装解析算法;(6).在真实旅游景点数据集上进行了大量对比实验。
[Abstract]:Entity resolution is a very traditional research direction and has gradually become a research hotspot in recent years. Unlike common entity resolution, domain-based entity resolution needs to analyze and capture the features of domain data comprehensively and make full use of them. The common entity resolution method usually matches the feature data in a single stage to complete the entity resolution. On the one hand, it will lead to the mutual interference of different feature data, on the other hand, it is not conducive to the targeted use of different feature data. Thus, the accuracy of entity resolution is affected. Therefore, under the background of tourism information field, based on the literature review of domain-independent and domain-based entity analysis, this paper proposes a method of entity parsing based on multi-stage mixed attributes. In this method, the feature information of scenic spots is fully extracted in different tourist data sources and different attributes of scenic spots, and the relevant feature information is used many times through designing the corresponding algorithm in multiple stages, and finally the entity analysis of scenic spots is realized. Among them, the different attributes of scenic spots include the name, site, and site profile. Entity analysis is divided into two stages, the first stage is to use the noun information in the introduction of scenic spots to cluster different tourist sites; the second stage is based on the clustering results. By using the similarity information of the scenic spot name and the person name and place name, the barreled algorithm is used to realize entity analysis. The innovation of this paper is as follows: 1. Solve the problem based on the entity analysis of tourist attractions. A framework of complete entity resolution for scenic spots based on multi-stage mixed attributes is proposed, and the effective information of entity attributes is used in different stages. This paper presents a mixed method for measuring the similarity of scenic spots. A k-means clustering optimization algorithm based on the farthest initial center and contour coefficient evaluation function is proposed. A barrelled analytical algorithm is modified. A large number of comparative experiments were carried out on the real tourist attraction data set.
【学位授予单位】:江西师范大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:F590.3
【参考文献】
相关期刊论文 前2条
1 杨丹;申德荣;于戈;聂铁铮;寇月;;数据空间中时间为中心的集合实体识别策略[J];计算机科学与探索;2012年11期
2 寇月;申德荣;刘恒;王泰明;聂铁铮;于戈;;异构网络中关联实体识别模型及增量式验证算法研究[J];计算机学报;2013年10期
相关硕士学位论文 前1条
1 杨莉;Web旅游信息集成中的信息融合研究[D];江西财经大学;2013年
,本文编号:1820310
本文链接:https://www.wllwen.com/jingjilunwen/lyjj/1820310.html