基于关键词的深度万维网数据库选择

发布时间：2018-07-29 09:20

【摘要】：该文提出一种基于关键词的深度万维网查询方法:用户用关键词的方式提交查询,该方法在线地选择能够反映查询意图并且提供高质量结果的万维网数据库.这种方法既避免了深度万维网数据抓取这一代价高、难度大的操作,又可支持多领域的数据库上的关键词查询,从而能够与现有的搜索引擎实现无缝集成.文中侧重于讨论基于关键词的数据库选择,从以下两个方面解决这一问题所涉及的挑战:(1)提出了一种度量关键词-领域属性关联的相关性模型,并设计了基于随机游动的算法从查询日志中发现潜在的关键词-属性关联;(2)给出了一种新的数据采样方法,并用于基于采样的数据库-查询的相关性模型中,最终解决深度万维网的数据库选择问题.在中文深度万维网真实数据集上的实验表明:提出的方法能够有效地选择与关键词查询相关的数据库,提供高质量的结果.
[Abstract]:In this paper, we propose a deep Web query method based on keywords: users submit queries in the form of keywords. This method selects the Web database which can reflect the intention of the query and provide high quality results online. This method not only avoids a generation of expensive and difficult operations of deep web data capture, but also supports keyword queries in multi-domain databases, thus realizing seamless integration with existing search engines. This paper focuses on the choice of database based on keywords, and addresses the challenges involved in this problem from the following two aspects: (1) A correlation model is proposed to measure the association of keyword and domain attributes. The algorithm based on random walk is designed to find potential keyword attribute association from the query log. (2) A new data sampling method is proposed and used in the database query correlation model based on sampling. Finally, the database selection problem of the deep World wide Web is solved. Experiments on the real data set of the Chinese Deep World wide Web show that the proposed method can effectively select the database related to keyword query and provide high quality results.
【作者单位】：清华大学计算机科学与技术系;
【基金】：国家自然科学基金重点项目“支持中文Web研究的基础设施建设和应用中的基本方法与关键技术”(60833003)资助
【分类号】：TP311.13

【共引文献】