当前位置:主页 > 科技论文 > 搜索引擎论文 >

面向数据空间的异构数据索引方法研究

发布时间:2018-07-11 12:16

  本文选题:数据空间 + 索引 ; 参考:《哈尔滨工程大学》2013年硕士论文


【摘要】:当前,个人和组织的信息呈现急剧增长趋势且非结构化数据所占比重在不断的增加,这些属于某个主体的海量、分布、异构和共存的数据构成了一个数据空间,如何为用户提供高效、便捷和多样化的搜索查询服务是数据空间面临的巨大挑战。然而,为数据空间中异构数据构建高效的索引方法是解决这一问题的基础。因此,研究数据空间中异构数据索引方法具有重要意义。 数据管理研究社区对索引方法已经存在很多的研究。过去,对索引方法的研究通常是基于单一数据格式和查询方式,例如,搜索引擎中的无结构化数据格式和关键词查询和关系数据库上的关系表和SQL查询。然而,数据空间中的数据具有多数据源、异构等特点,它可能包含结构化、半结构化和无结构化等多种数据格式,,另外,由于数据空间的Pay-as-you-go特性使得需要提供从关键字查询到结构化查询等多样化搜索查询服务,例如,起初由于抽取信息较弱和数据源之间没有建立语义关联,可以只提供基本的关键字搜索服务,随着时间的推移用户和系统将会逐渐的建立更多的模式、语义关联信息,系统也将能够支持更加丰富的查询方式。因此,与传统的索引方法不同,数据空间中的索引方法需要能够索引多种格式数据,同时支持关键字查询和结构化查询等多种查询方式。 通过对现有数据模型和查询分析,本文使用iMeMex数据模型作为数据空间的数据模型且给出了关键字查询、谓词查询和路径查询三种查询方式的定义,在此基础上提出了一种新的索引方法来提高对数据空间中异构数据的搜索查询效率,本文称之为EIBH混合索引方法。新的索引方法由扩展的倒排列表和两个辅助索引构成,通过扩展倒排列表表的关键字列和链表节点信息索引资源视图来支持三种查询和提高查询处理效率;利用两个辅助索引来解决索引连接效率低下问题。实验结果表明:该索引方法能够有效、可行的解决数据空间中异构数据索引和查询效率问题。
[Abstract]:At present, the information of individuals and organizations is increasing rapidly and the proportion of unstructured data is increasing continuously. These data, which belong to a certain entity, are massive, distributed, heterogeneous and coexisting, and constitute a data space. How to provide users with efficient, convenient and diversified search and query services is a huge challenge in data space. However, efficient indexing method for heterogeneous data in data space is the basis to solve this problem. Therefore, it is of great significance to study the index method of heterogeneous data in data space. There has been a lot of research on indexing methods in the data management research community. In the past, indexing methods were usually based on a single data format and query methods, such as unstructured data formats and keyword queries in search engines and relational tables and SQL queries in relational databases. However, the data in the data space has the characteristics of multiple data sources and heterogeneity. It may contain many kinds of data formats, such as structured, semi-structured and unstructured. The Pay-as-you-go feature of the data space makes it necessary to provide a variety of search query services, from keyword queries to structured queries, for example, because of weak extraction information and no semantic association between data sources, It can only provide basic keyword search services. Over time, users and systems will gradually build more patterns, semantic association information, the system will also be able to support more rich query methods. Therefore, unlike traditional indexing methods, indexing methods in data space need to be able to index multiple formats of data, and support a variety of query methods, such as keyword query and structured query. By analyzing the existing data models and queries, this paper uses iMeMex data model as the data model of data space and gives the definitions of keyword query, predicate query and path query. On this basis, a new indexing method is proposed to improve the efficiency of searching and querying heterogeneous data in data space, which is called EIBH mixed index method. The new indexing method is composed of extended inverted list and two auxiliary indexes. It supports three kinds of queries and improves the efficiency of query processing by extending the keyword column and linked list node information index resource view of inverted list table. Two auxiliary indexes are used to solve the problem of low efficiency of index join. Experimental results show that the proposed indexing method is effective and feasible to solve the problem of index and query efficiency of heterogeneous data in data space.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前3条

1 李保利,陈玉忠,俞士汶;信息抽取研究综述[J];计算机工程与应用;2003年10期

2 刘迁;焦慧;贾惠波;;信息抽取技术的发展现状及构建方法的研究[J];计算机应用研究;2007年07期

3 李玉坤;孟小峰;张相於;;数据空间技术研究[J];软件学报;2008年08期



本文编号:2115169

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2115169.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户fd48f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com