搜索引擎中命名实体查询处理相关技术研究
发布时间:2018-06-12 03:53
本文选题:命名实体 + 查询切分 ; 参考:《哈尔滨工业大学》2012年博士论文
【摘要】:当前互联网已经成为人们获取信息和进行事务活动的一个重要平台。随着互联网上各种数据和应用资源的快速增长,搜索引擎成为人们从海量的网上资源中快速准确地获取信息的必要工具。用户通过提交查询到搜索引擎表达他们的信息需求,搜索引擎则根据对查询的分析提供给用户需要的检索结果,查询是用户和搜索引擎之间必要的信息传递方式。为了使搜索引擎能够准确地理解查询中表达的信息需求,则需要开展查询自动分析处理技术的研究。 命名实体查询是一类重要的查询,在搜索引擎查询中占有很高的比例,并且具有一些自身特点,研究命名实体查询的相关处理技术能够使搜索引擎更好地分析用户的检索意图,提供给用户准确的检索结果,改善用户的检索体验。命名实体查询处理技术通常包括获取查询中的语义片段,识别出查询中包含的实体,分析命名实体查询的检索意图等方面的研究。据此,本文从以下几个方面开展了命名实体查询处理的相关技术研究。 1、基于单语词对齐模型的无指导查询自动切分。查询切分是一项基础和必要的查询处理工作,是将查询从字符序列切分出词汇或短语等语义单元的过程。由于查询中出现的词汇规模巨大并且包含许多不规范的词汇,有指导的方法需要人工标注大量的训练语料,使其不能很好地适应查询切分的任务。本文提出了一种基于单语词对齐模型的无指导查询切分方法。该方法仅利用查询日志自动训练查询切分模型,并在模型中能够结合字符的共现信息、位置信息以及繁殖度信息,获得了较好的查询切分效果。本文在查询词项切分的基础上进一步对查询进行了层次化切分,将查询表示为切分片段的树状结构,查询层次化切分结果可以表示出查询中哪些切分片段之间的关系更为紧密。实验结果显示与已有的切分方法相比,本文方法获得了更好的查询切分效果。 2、基于图上随机游走模型的查询日志中命名实体挖掘。查询日志是一个包含大量命名实体的数据资源。从查询日志中挖掘出的命名实体,更加符合用户构造查询时使用命名实体的习惯,并且查询日志会不断更新,其中记录了一些新出现的实体名称,这使得研究查询日志中命名实体挖掘对于搜索引擎处理命名实体查询更具有实际意义。本文中采用了一种弱指导的方法进行命名实体挖掘,其中利用了少量的属于目标类别的命名实体名称作为种子,使用从查询日志中抽取出的候选命名实体、查询中命名实体的上下文模板以及用户点击URL构造三分图,采用图上的随机游走算法获取目标类别的命名实体。实验结果显示,本文方法能够有效结合查询日志中的命名实体相关信息,提高查询日志中获取命名实体的准确率。 3、基于在线百科的命名实体同义属性短语获取。在命名实体的属性短语中,描述实体同一属性的不同表达形式的短语,被称为同义属性短语。获取实体的同义属性短语对命名实体查询的检索意图分析将有所帮助。在命名实体查询中,用户通常使用属性短语构建查询,,表达对实体属性值的需求意图。本文从在线百科中获取命名实体的属性短语,并采用了分类的框架结合了多种特征去识别出其中的同义属性短语。据我们了解,本文方法是首次提出利用在线百科获取同义属性短语的研究。实验结果表明,在线百科是获取实体同义属性短语的有效资源,并且本文提出的方法能够有效地获取大量的同义属性短语。 4、命名实体查询的检索意图识别。在本文中包括基于分类的查询检索意图识别和更细粒度的基于查询检索模式的检索意图识别两个部分。查询意图分类可以限制检索结果的类别空间,提高检索准确率。在查询意图分类中,采用融合多种资源信息的方法进行分类,其中根据对查询文本,查询日志以及互联网检索结果的分析,获取了有效的查询意图分类特征。本文进一步在查询意图分类模型识别出的信息类和事务类命名实体查询中,抽取用户经常使用的查询检索模式,并将具有相似检索意图的查询检索模式进行聚类。查询检索模式可以用来匹配用户提交的查询,帮助搜索引擎准确地分析查询的检索意图。本文中采用了基于图模型方法和基于相似度方法级联地进行命名实体查询的检索模式获取。实验结果显示本文方法在多个实体类别上均有效地获取了查询检索模式。 综上所述,本文开展了命名实体查询处理一些关键技术的研究工作,其中有些查询处理技术出于更广泛适应性的考虑,其面向的对象不仅是命名实体查询,也可以应用到其他查询上。在研究中取得了一些初步的结论和成果,希望能对搜索引擎的命名实体查询处理任务有所裨益。
[Abstract]:The Internet has become an important platform for people to obtain information and conduct business activities. With the rapid growth of all kinds of data and application resources on the Internet, the search engine has become a necessary tool for people to obtain information quickly and accurately from the mass of online resources. Users have passed submission queries to the search engines to express them. The search engine provides the retrieval results to the users according to the analysis of the query. The query is the necessary way of information transfer between the user and the search engine. In order to make the search engine understand the information requirements expressed in the query, it needs to carry out the research of automatic query analysis and processing technology.
Named entity query is an important kind of query, which occupies a very high proportion in search engine query and has some own characteristics. Research on the related processing technology of named entity query can make the search engine better analyze the user's retrieval intention, provide the user with accurate retrieval results, improve the user's retrieval experience. The body query processing technology usually includes obtaining the semantic fragments in the query, identifying the entities contained in the query, and analyzing the search intention of the named entity query. Based on this, this paper has carried out the related technology research of the named entity query processing from the following aspects.
1, automatic segmentation of undirected query based on the single word alignment model. Query segmentation is a basic and necessary query processing. It is the process of dividing the semantic units such as words or phrases out of the sequence of characters. A large number of training materials are annotated artificially to make it difficult to adapt to the task of query segmentation. In this paper, an undirected query segmentation method based on the word alignment model is proposed. This method can automatically train query segmentation model by using query log, and can combine the concurrence information, location information and reproduction degree in the model. In this paper, a better query segmentation effect is obtained. In this paper, a hierarchical segmentation is carried out on the basis of the segmentation of query words. The query is expressed as the tree structure of the segmentation fragment. The query hierarchical segmentation results can show the close relation between the segmentation segments in the query. The experimental results show that the relationship between the segmentation fragments is more closely. Compared with the segmentation method, the proposed method achieves better query segmentation effect.
2, named entity mining in the query log based on the random walk model. The query log is a data resource containing a large number of named entities. The named entity mining from the query log is more consistent with the custom of using named entity when the user constructs the query, and the query daily chronicles are constantly updated, in which some new appearance is recorded. The name of the entity, which makes the study of naming entity mining in the query log more meaningful for the search engine to handle named entity queries. In this paper, a weak guidance method is used for naming entity mining, in which a small number of named entity names belonging to the target category are used as seeds and used from the query log. The candidate naming entity, the context template of the named entity in the query and the user clicking URL to construct the three partite graph, use the random walk algorithm on the graph to obtain the named entity of the target category. The experimental results show that this method can effectively combine the related information of the named real body in the query log and improve the name of the name in the query log. The accuracy of the body.
3, named entity synonymous attribute phrase based on online encyclopedia. In the attribute phrase of the named entity, the phrase describing the different expression of the entity's same attribute is called synonymous attribute phrase. It will help to analyze the retrieval intention of the named entity query by obtaining the entity's synonym phrase. We usually use the attribute phrase to construct the query to express the requirement intention of the entity attribute value. This paper obtains the attribute phrase of the named entity from the online encyclopedia, and uses the classification framework to combine a variety of features to identify the synonymous attribute phrases. According to our understanding, this method is the first time to use online encyclopedia to obtain synonyms. The experimental results show that the online encyclopedia is an effective resource for obtaining the entity synonymous attribute phrases, and the method proposed in this paper can effectively obtain a large number of synonymous attribute phrases.
4, the retrieval intention recognition of named entity query. In this article, two parts are included in the classification based query retrieval intention recognition and the more finer query retrieval pattern based on query retrieval mode. The query intention classification can limit the category space of the retrieval results and improve the retrieval accuracy. In the query intention classification, the fusion of multiple types is used. According to the analysis of query text, query log and the analysis of Internet retrieval results, the effective classification features of query intention are obtained. In this paper, the query retrieval mode used by users is extracted in the information class and transaction class named entity query identified by the query intention classification model. The query retrieval mode with similar retrieval intention is clustered. The query retrieval mode can be used to match the queries submitted by the user and help the search engine to accurately analyze the retrieval intention of the query. In this paper, the retrieval mode based on the graph model method and the similarity method cascaded into the named entity query is used. The results show that our method effectively retrieves the query retrieval mode in many entity categories.
To sum up, this paper has carried out the research work on some key technologies of named entity query processing. Some of the query processing technologies are not only named entity query but also applied to other queries for more extensive adaptability, and some preliminary conclusions and results are obtained in the study. The named entity query processing task of cable engine is beneficial.
【学位授予单位】:哈尔滨工业大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:TP391.3
【相似文献】
相关期刊论文 前10条
1 高文利;;军备情报说明文的武器对象判定[J];软件导刊;2010年02期
2 蔡爱杰;牟童;;基于Web的实体关系发现的研究[J];哈尔滨师范大学自然科学学报;2010年05期
3 刘路;李弼程;张先飞;;基于向量相似度修正策略的命名实体关联分析[J];计算机工程与应用;2008年02期
4 潘渊;李弼程;张先飞;;一种基于自适应重心向量的主题检测方法[J];计算机工程;2009年03期
5 潘正高;侯传宇;谈成访;;基于命名实体的Web新闻文本分类方法[J];合肥工业大学学报(自然科学版);2011年08期
6 王睿,张洁,张由仪,于y
本文编号:2008216
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2008216.html