基于结果类型分组的XML关键词查询

发布时间：2018-04-17 10:12

本文选题：熵值赋权法 + 结果类型　；参考：《广西师范大学》2011年硕士论文

【摘要】：随着Internet应用的快速发展,Web已逐渐成为一个巨大的海量信息空间。面对如此庞杂的信息资源,人们无法只凭自身能力获取有价值的信息,而是必须借助一些外部工具来获取,因此,Web搜索引擎随之产生。它为人们从Internet中获取所需信息发挥了极其重要的作用,但随着信息量的急剧增加以及信息种类的日趋丰富,现有的Web搜索引擎已无法满足这种日益增长的信息需求。目前在网络上流行的基于关键字的Web搜索引擎,所返回的查询结果都是基于HTML的整个页面,它在包含用户所需信息的同时,还包含了许多对用户来说没有价值的信息,例如广告。若使用基于XML数据的信息检索,返回查询结果仅仅是成千上万个与用户查询目标有关的数据片段。目前基于XML数据的信息检索在结构化查询方面的工作已取得一定进展,例如XQuery。与结构化查询语言相比,XML的关键字检索技术的主要优势就是用户不需要学习复杂的查询语言,也不需要对XML文档底层的数据结构有深入的了解,用户仅仅需要输入与他感兴趣内容相关的关键字就可完成查询。因此,基于关键字的XML信息检索成为XML数据检索的研究热点之一。本文认为一个完整的信息检索模型的逻辑结构可以分为两个部分：一个是如何获取查询结果,另一个是查询结果的相似度排名。但为了能够实现上述两部分,我们还需要一些公共基础。首先,由于XML文档结构的独特性,我们需要对每一个XML结点进行编码,我们要求该编码不仅能够唯一标识每一个结点,而且还能表示出结点与结点之间的结构关系。因此,本文选取Dewey编码对XML文档进行编码。在表示XML文档的同时,还能够完成一些简单的结点间运算。其次,在实现搜索引擎的过程中,我们将会用到一些结点信息及其对应的数据信息——倒排索引,因此需要一个适合的容器工具来存放它们。考虑到嵌入式数据库能够使倒排索引与应用程序进程进行无缝连结,本文采用了嵌入式数据库Berkeley DB来实现,它使得倒排索引与应用程序运行于同样的地址空间中,消除了与客户机服务器配置相关的开销,并且应用程序不需要事先同数据库服务建立起网络连接,而是通过内嵌在程序中的Berkeley DB函数库来完成对数据的保存、查询、修改和删除等操作。这样一来,我们在实验过程中可以忽略获取倒排索引的时间,从而削弱了倒排索引对实验主体的负面影响。在获取查询结果方面,本文介绍了基于XML关键字查询的几个重要语义及其对应的查询处理算法。紧接着,通过对比分析这几种语义对应的查询结果,总结它们存在的问题,本文提出了优质的查询结果必须具备的三项规则。基于这三项规则,我们提出了一个新的理念、。首先,从宏观上使用熵值赋权法确定查询结果的类型,这使得查询结果符合用户的基本意图。接着,再从微观上对其进行分组,保证每一个逻辑组能够成为一个包含完整信息的查询结果。此外,本文还设计了一组实验,从查询质量以及查询效率和稳定性两方面对查询结果进行了实验分析。实验数据表明,这三项规则以及熵值赋权法对查询结果的确定具有较高的可行性。在相关度排名方面,本文介绍了传统的基于平面型文档的相似性度量方法,它是研究基于XML相似性度量方法的基础；以及最新提出的基于XML的相似性度量方法,它兼顾了XML文档的结构特点,但是由于该算法使用了递归的思想,因此在效率以及稳定性上存在一定的缺陷。鉴于此,本文在基于平面型文档的相似性度量方法以及逻辑组概念的基础上,设计一个基于逻辑组的XML相似性度量方法。该方法不仅兼顾了XML文档的结构特性,还将计算规模限制在一个可控制的范围内。为了证明该算法的有效性,本文从排名质量以及排名效率和稳定性两方面对该算法进行了对比实验。实验数据表明,该方法在效率及稳定性上取得了明显的提高。
[Abstract]:With the rapid development of Internet applications, Web has become a huge mass of information space. In the face of such a vast and complex information resources, people can not only rely on their own ability to obtain valuable information, but must use some external tools to obtain, therefore, Web search engine emerged. It can help people get from the Internet required information plays a very important role, but with the dramatic increase in the amount of information and the diverse kinds of information, the current Web search engine has been unable to meet this growing demand. At present in the network information flow for keyword based Web search engine, the query results are returned by the HTML based on the entire page, to it contains user information at the same time, also contains a lot of no value to the users of information, such as advertising. If the use of information retrieval based on XML data, return the check To inquire about the result is only tens of thousands of pieces of data related to user query target. At present, based on the structured query has made some progress in the aspects of the work of XML data in information retrieval, such as XQuery. and SQL compared to XML keyword search technology's main advantages is that the user does not need to learn complex query language, also do not need to have deep understanding of the underlying data structure of XML document, the user only need to input and he is interested in the content of the relevant keyword query can be completed. Therefore, the keyword XML information retrieval has become a research hotspot of data retrieval based on XML.
This paper considers that an integrated information retrieval model of the logical structure can be divided into two parts: one is how to obtain the query results, the other is the similarity of ranking search results. In order to achieve the above two parts, we also need some public infrastructure. Firstly, due to the unique structure of XML document, we need to encoding of each XML node, we require the encoding not only can uniquely identify each node, but also shows the structural relationship between nodes. Therefore, this paper selects the XML encoding Dewey documents. In the XML document at the same time, also can do some simple operations. Secondly between nodes in the process, the realization of the search engine, we will use some information and the corresponding node data, inverted index, so we need a suitable container to store them. Considering the embedded database can make inverted index and link seamlessly with the application process, the paper adopts the embedded database Berkeley DB, which makes the inverted index and applications run in the same address space, eliminating the associated with a client server configuration overhead, and the application does not need to advance with the establishment of database service the network connection, but embedded in the program through the Berkeley DB function library to save the data, query, modify and delete operations. In this way, we can ignore the time to obtain inverted index in the experimental process, thus weakening the negative influence on the inverted index of the subject.
In the query results, this paper introduces the query processing algorithm XML keyword query and the corresponding semantic based on several important. Then, through the comparative analysis of the query results of these semantics, summarize the existing problems, this paper presents three rules of high-quality search results must have. Based on these three rules. We propose a new idea. First, from the macroscopic entropy weighting method is used to determine the type of query results, the query results are consistent with the basic intention of users. Then, then divided the group from the micro level, to ensure that every one can become a logical group contains the complete information query results. In addition, this paper also design a set of experiments, the query results are analyzed from the search quality and the efficiency and stability of two aspects. The experimental data show that the three rule and the entropy of Fu The right method has a high feasibility to determine the result of the query.
In terms of relevance rank, this paper introduces similarity measurement of planar document based on the traditional method, it is based on the XML similarity measure method based on; and the latest proposed XML based similarity measure, which take into account the structure characteristics of XML document, but because the algorithm uses recursive thinking therefore, there are some defects in efficiency and stability. In view of this, based on planar document similarity measure method and logical groups on the basis of the concept, design a logical group of XML based on similarity measure method. This method not only considers the structural characteristics of XML documents, will also limit the size of the calculation a controlled range. In order to prove the validity of the algorithm, this paper from the ranking quality and efficiency ranking and stability of two aspects of the comparative experiment of the algorithm. The experimental data show that the The method has made a significant improvement in efficiency and stability.

【学位授予单位】：广西师范大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP391.3

【相似文献】