异构信息网络检索技术研究

发布时间：2018-05-07 08:29

本文选题：异构信息网络 + 信息检索　；参考：《湖南大学》2014年博士论文

【摘要】：现实世界中各种信息对象和它周围的信息对象都在不同方面、不同层次，以不同方式相互影响、相互作用着，从而组成了复杂的信息网络。信息网络不仅能帮助我们更好的表达和存储现实世界中的本质信息，而且通过对信息网络中的联接信息进行分析，它可以作为一种挖掘现实世界中隐藏信息的有用工具。因此，从信息网络中挖掘信息获取知识已成为当前的研究热点之一。本文在分析了信息网络尤其是异构信息网络的研究现状的基础上，通过分析信息文档及其相关对象的关系构建异构信息网络，研究了半监督学习、文档聚类、检索结果聚类标签抽取以及查询推荐等信息检索中的关键技术。论文的主要研究工作和创新点如下：（1）提出了针对查询和文档的内容特征以及点击关系构造异构信息网络及半监督学习的框架。根据查询和文档自身内容特征分别构造基于特征的相似图，同时基于查询和文档之间的点击关系构建查询-文档二部图，并引入标记样本的判别信息强化网络结构。提出了查询-文档异构信息网络上半监督学习的正则化框架和标记传播算法。在给出少量标签的情况下，本文方法能更充分的利用查询和文档本身的内容信息，并借助于相互之间的关系互相传播，实验表明本文方法优于传统的半监督学习方法比较。（2）为包含多种类型和联系的高阶异构信息网络建立了图正则化的半监督学习框架。在该框架中，使用图正则化区分了不同类型联系的语义，提出了一种能充分保留标记样本和未标记样本共同揭示的空间结构的光滑性的代价函数，并得到了该代价函数的闭式解。提出了高阶异构信息网络上的标记传播算法，标记信息从标记节点不断向邻近节点传播直至稳定状态，证明了标记传播算法将收敛于代价函数的闭式解。在该框架之下，一些经典的半监督学习算法可以作为其特例存在。（3）针对查询-文档富文本异构信息网络提出了两种不同的主题传播模型：TP-TS和TP-Unify。TP-TS把主题建模和随机漫步看成是两个独立的过程，首先通过潜在概率主题分析（PLSA）对文本内容构建主题模型，然后主题信息在异构的查询-文档二部图互相传播，从而揭示不同节点的主题并进行类别划分。TP-Unify把异构信息网络上异构节点之间的一致性约束引入主题分析，在进行主题建模的同时结合了网络结构分析技术。（4）提出了一种新的类别标签抽取的方法，其基本思想是把类别标签抽取转化为与类簇相关的查询词的排序问题，从而避免了从网页文档簇中抽取主题词的操作。提出了一种融合查询-网页点击图、网页相似图以及链接图对查询词和网页进行联合排序的算法，该算法能有效的整合用户、网页创建者和网页写作者对网页的评价。（5）把基于日志分析和基于语义分析的查询推荐技术结合起来，通过构造Term-Query-URL异构信息网络同时分析日志信息及语义信息，，采用基于查询的重启动随机游走进行查询推荐。借助于点击日志进行协同推荐，在高频查询上能取得很好的效果，采用基于文档的方法训练词汇和查询词之间的语义关系，可以提高稀疏查询的推荐效果。在大规模商业搜索引擎查询日志上的实验表明本文方法优于现有的查询推荐方法。
[Abstract]:This paper analyzes the key technologies of information network , such as semi - supervised learning , document clustering , retrieval result clustering label extraction and query recommendation . The main research and innovation points of this paper are as follows :

( 1 ) A framework for constructing heterogeneous information networks and semi - supervised learning for queries and documents is proposed . Based on the characteristics of query and document ' s content characteristics , a similarity diagram based on features is constructed , and a query - document two - part graph is constructed based on the click relationship between queries and documents .

( 2 ) A semi - supervised learning framework of graph regularization is established for high - order heterogeneous information networks containing many types and connections . In this framework , the semantics of different types of links are distinguished by using graph regularization , and a closed solution of the cost function is obtained .

( 3 ) Two different theme propagation models are proposed for the query - document rich text heterogeneous information network : TP - TS and TP - N _ 2 . The TP - TS combines the subject modeling and the random walk as two independent processes . First , the topic model of different nodes is revealed through the potential probabilistic topic analysis ( PLSA ) .

( 4 ) A new method of class label extraction is proposed . The basic idea is to transform the category label extraction into the sort of query word related to cluster cluster , so as to avoid the operation of extracting the subject word from the webpage document cluster . A fusion query - web page click graph , web page similarity graph and link graph are proposed to sort query words and web pages . The algorithm can effectively integrate users , web creators and web writers on the evaluation of web pages .

( 5 ) combining the log analysis and the query recommendation technology based on the semantic analysis , analyzing the log information and the semantic information simultaneously by constructing the Term - Query - URL heterogeneous information network , carrying out query recommendation by using the query - based re - starting random walk .

【学位授予单位】：湖南大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP391.3

【相似文献】