微博搜索的关键技术研究
发布时间:2018-07-28 07:39
【摘要】:微博迅速成为一种重要的实时信息源,其搜索存在两个重要问题:查询词与微博消息的相关性计算、搜索结果整理。相关性计算从内容、语义上衡量消息与查询词之间的相似程度;搜索结果整理以简明有序的方式组织消息,克服冗余性和不规范书写,主要整理方式包括分类、摘要等。本文以推特为例,探索了微博搜索研究任务中几个重要问题:相关性计算、查询结果分类、摘要和对比话题摘要。 针对相关性计算问题,提出两种消息排序模型,分别基于学习排序和递归神经网络语言模型。与目前微博搜索服务中的相关性排序算法比较,前者显著提高了消息序列的相关性,后者缩短消息在计算语义相关度时的鸿沟,提高查询结果的覆盖率。基于学习排序的模型系统研究了文本相关性特征、微博书写特征和微博的作者权威度特征在微博相关性计算中的作用。基于递归神经网络语言模型的排序模型将语义相似度引入消息的相关性计算中,在词向量粒度上计算消息之间的词法语义相似度。 针对搜索结果分类问题,提出基于消息关联关系的协同分类模型,为微博定义了一个话题分类体系。与基于特征的基准模型相比,该模型的准确率和F值分别提高了5.38%和4.74%。该模型将消息之间的两种共享话题关系应用到三种基于图的协同分类模型中,考虑本地特征和来自关联消息的类别分布,同时为一批微博消息分类,降低数据稀疏的影响,极大地提高了分类器的精确率和召回率,其中采用共享话题标签(#hashtag)关系的迭代分类算法结果最优。 针对搜索结果摘要问题,提出基于时间轴的关联交互自增强式摘要模型。与基于图的基准模型比较,该模型的ROUGE-1平均提高了14%。给定查询词的搜索结果,该模型按照时间顺序将其划分成若干子话题,同时考虑文本内容、作者社会影响力和文本质量计算消息的重要度,根据重要度和多样性对微博消息进行排序和抽取以生成摘要。实验表明,作者的社会影响力和文本质量有效地改进了文本重要度的度量。 针对对比话题摘要问题,提出基于消息关联关系的最优化对比话题摘要模型。与基于内容相似度计算的基准模型比较,该模型的对比属性覆盖率和比较消息对准确率分别提高了14.7%和11.6%。该模型充分利用消息之间的相似度关系和三种共享话题关系,采用网页排序算法和SimRank方法,最大化消息对的内部对比性和话题代表性,概括对比查询词搜索结果中的共同点和不同点生成摘要。
[Abstract]:Weibo has quickly become an important source of real-time information. There are two important problems in its search: the correlation calculation between query words and Weibo messages, and the sorting of search results. Correlation calculation measures the similarity between messages and query words in terms of content and semantics. Search results organize messages in a concise and orderly manner to overcome redundancy and non-standard writing. The main sorting methods include classification, summary and so on. Taking Twitter as an example, this paper explores several important issues in Weibo search research tasks: correlation calculation, query result classification, summary and comparative topic summary. In order to solve the problem of correlation computation, two message ordering models are proposed, which are based on learning sorting and recursive neural network language model. Compared with the existing correlation sorting algorithm in Weibo search service, the former significantly improves the correlation of message sequences, while the latter shortens the gap in the calculation of semantic correlation and improves the coverage of query results. The model based on learning ranking systematically studies the role of text correlation features, Weibo writing features and authorship features of Weibo in the calculation of Weibo correlation. The ranking model based on recursive neural network language model introduces semantic similarity into message correlation calculation and calculates lexical semantic similarity between messages on word vector granularity. To solve the problem of search result classification, a cooperative classification model based on message association is proposed, and a topic classification system is defined for Weibo. Compared with the feature-based benchmark model, the accuracy and F value of the model are increased by 5.38% and 4.74%, respectively. The model applies two kinds of shared topic relationships between messages to three graph-based cooperative classification models, considering local features and category distribution from associated messages. At the same time, it classifies a batch of Weibo messages to reduce the effect of data sparsity. The precision and recall rate of the classifier are greatly improved, and the iterative classification algorithm using #hashtag relation is optimal. In order to solve the problem of search result summary, a time axis based autoenhancement model of associative interaction is proposed. Compared with the graph-based benchmark model, the average ROUGE-1 of this model is increased by 14%. Given the search results of the query words, the model divides the query words into several sub-topics according to the time order, and considers the content of the text, the author's social influence and the importance of calculating the message of the text quality. Weibo messages are sorted and extracted according to importance and diversity to generate abstracts. Experiments show that the author's social influence and text quality can effectively improve the measurement of text importance. Aiming at the problem of comparative topic summary, an optimal model of contrast topic summary based on message association relationship is proposed. Compared with the benchmark model based on content similarity calculation, the comparison attribute coverage and comparison message pair accuracy of the model are improved by 14.7% and 11.6%, respectively. The model makes full use of the similarity relationship between messages and three kinds of shared topic relationships, and uses web page sorting algorithm and SimRank method to maximize the internal comparison and topic representation of message pairs. Generalize and compare the common points and differences in the search results of query terms to generate a summary.
【学位授予单位】:中国科学技术大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:TP393.092;TP391.1
本文编号:2149383
[Abstract]:Weibo has quickly become an important source of real-time information. There are two important problems in its search: the correlation calculation between query words and Weibo messages, and the sorting of search results. Correlation calculation measures the similarity between messages and query words in terms of content and semantics. Search results organize messages in a concise and orderly manner to overcome redundancy and non-standard writing. The main sorting methods include classification, summary and so on. Taking Twitter as an example, this paper explores several important issues in Weibo search research tasks: correlation calculation, query result classification, summary and comparative topic summary. In order to solve the problem of correlation computation, two message ordering models are proposed, which are based on learning sorting and recursive neural network language model. Compared with the existing correlation sorting algorithm in Weibo search service, the former significantly improves the correlation of message sequences, while the latter shortens the gap in the calculation of semantic correlation and improves the coverage of query results. The model based on learning ranking systematically studies the role of text correlation features, Weibo writing features and authorship features of Weibo in the calculation of Weibo correlation. The ranking model based on recursive neural network language model introduces semantic similarity into message correlation calculation and calculates lexical semantic similarity between messages on word vector granularity. To solve the problem of search result classification, a cooperative classification model based on message association is proposed, and a topic classification system is defined for Weibo. Compared with the feature-based benchmark model, the accuracy and F value of the model are increased by 5.38% and 4.74%, respectively. The model applies two kinds of shared topic relationships between messages to three graph-based cooperative classification models, considering local features and category distribution from associated messages. At the same time, it classifies a batch of Weibo messages to reduce the effect of data sparsity. The precision and recall rate of the classifier are greatly improved, and the iterative classification algorithm using #hashtag relation is optimal. In order to solve the problem of search result summary, a time axis based autoenhancement model of associative interaction is proposed. Compared with the graph-based benchmark model, the average ROUGE-1 of this model is increased by 14%. Given the search results of the query words, the model divides the query words into several sub-topics according to the time order, and considers the content of the text, the author's social influence and the importance of calculating the message of the text quality. Weibo messages are sorted and extracted according to importance and diversity to generate abstracts. Experiments show that the author's social influence and text quality can effectively improve the measurement of text importance. Aiming at the problem of comparative topic summary, an optimal model of contrast topic summary based on message association relationship is proposed. Compared with the benchmark model based on content similarity calculation, the comparison attribute coverage and comparison message pair accuracy of the model are improved by 14.7% and 11.6%, respectively. The model makes full use of the similarity relationship between messages and three kinds of shared topic relationships, and uses web page sorting algorithm and SimRank method to maximize the internal comparison and topic representation of message pairs. Generalize and compare the common points and differences in the search results of query terms to generate a summary.
【学位授予单位】:中国科学技术大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前10条
1 刘志明;刘鲁;;微博网络舆情中的意见领袖识别及分析[J];系统工程;2011年06期
2 张晨逸;孙建伶;丁轶群;;基于MB-LDA模型的微博主题挖掘[J];计算机研究与发展;2011年10期
3 杨亮;林原;林鸿飞;;基于情感分布的微博热点事件发现[J];中文信息学报;2012年01期
4 张剑峰;夏云庆;姚建民;;微博文本处理研究综述[J];中文信息学报;2012年04期
5 文坤梅;徐帅;李瑞轩;辜希武;李玉华;;微博及中文微博信息处理研究综述[J];中文信息学报;2012年06期
6 彭泽环;孙乐;韩先培;石贝;;基于排序学习的微博用户推荐[J];中文信息学报;2013年04期
7 李锐;王斌;;一种基于作者建模的微博检索模型[J];中文信息学报;2014年02期
8 何黎;何跃;霍叶青;;微博用户特征分析和核心用户挖掘[J];情报理论与实践;2011年11期
9 平亮;宗利永;;基于社会网络中心性分析的微博信息传播研究——以Sina微博为例[J];图书情报知识;2010年06期
10 李军;陈震;黄霁崴;;微博影响力评价研究[J];信息网络安全;2012年03期
,本文编号:2149383
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2149383.html