基于语义相似度计算及Twitter Storm平台的微博检索研究
发布时间:2018-07-03 18:22
本文选题:微博 + 语义扩展 ; 参考:《武汉理工大学》2014年硕士论文
【摘要】:随着互联网在国内外的飞速发展,微博作为一款在世界各地被广泛使用的互联网社交产品具有跨时代的意义。它在为用户提供开放和集中的互联网社交服务的同时,逐渐发展为具有较大影响力的新媒体。鉴于微博数据的大规模及实时的特点,如何在海量及动态更新的微博数据中为用户提供其感兴趣的内容显得尤为重要。 本文所讨论的基于特征扩展和相似度计算的微博检索的内容包括:1、扩展微博短文本的内容,丰富微博的语义特征,为检索结果与检索关键字在语义上的相关性提供保障。2、利用WordNet机器语义字典的网状结构得到较准确的微博语义相似度值。3、以相似度值的高低作为检索排序的标准来模拟一个实时的微博检索过程,能够完成对关键字的微博检索,并为每一个检索到的微博提供相关微博的列表。 在丰富微博语义方面,本文提出基于维基百科的语义特征扩展方法,该方法将微博中的名词作为表达微博主题的关键词,对名词进行关联拓展以丰富微博的信息内容。具体地,本文将维基百科作为语义特征的扩展源,将名词词条中的“category”模块下所包含的类别作为扩展语义特征添加到原微博中来丰富微博语义,并通过实验证明使用该语义扩展方法能够在一定程度上提高相似度计算结果的质量。在获取较高准确度的微博相似度值方面,本文利用了普林斯顿大学开发的英语词网数据库WordNet的网状结构得到基于微博语义的相似度。具体地,我们使用[37]中提出的基于路径长度的方法,同时考虑两个单词以及它们的最近公共节点在WordNet中距离根节点的路径长度(深度)来计算语义相似度,在实验中与基于VSM的余弦相似度方法做比较证明该方法能够在一定程度上提高找到相关微博的准确度与召回率。在模拟实时微博检索方面,本文研究了开源及实时的数据处理平台Twitter Storm的架构及应用,采用本地模式模拟数据的实时和分布式处理。具体地,本文定义了自己的微博检索拓扑结构,,并实现拓扑结构中的每个节点功能,包括twitter数据集的预处理、节点间信息传输、多节点的相似度的并行计算与相似度表的维护、基于相似度值的检索结果排序,以及为每个检索结果提供相关微博等,从而将微博检索排序嵌入到了Twitter Storm平台上。
[Abstract]:With the rapid development of the Internet at home and abroad, Weibo, as a widely used social product in the world, has a cross-epoch significance. While providing users with open and centralized Internet social services, it has gradually developed into new media with greater influence. In view of the large scale and real-time characteristics of Weibo data, it is particularly important to provide users with interesting content in the massive and dynamically updated Weibo data. The content of Weibo retrieval based on feature extension and similarity calculation discussed in this paper includes: 1, extending the content of short text of Weibo, enriching the semantic features of Weibo. In order to guarantee the semantic correlation between retrieval results and search keywords, a more accurate semantic similarity value of Weibo. 3 is obtained by using the mesh structure of WordNet machine semantic dictionary, and the level of similarity value is regarded as the standard of retrieval ranking. To simulate a real-time Weibo retrieval process, The ability to complete Weibo retrieval of keywords and provide a list of relevant Weibo for each retrieved Weibo. In order to enrich the semantics of Weibo, this paper proposes a method of extending semantic features based on Wikipedia. In this method, the nouns in Weibo are used as keywords to express the subject of Weibo, and the nouns are extended to enrich the information content of Weibo. In this paper, Wikipedia is used as the extension source of semantic features, and the categories contained under the "category" module of nouns are added to the original Weibo to enrich the Weibo semantics. Experiments show that the semantic extension method can improve the quality of the similarity calculation results to a certain extent. In order to obtain the Weibo similarity value with high accuracy, this paper uses the mesh structure of WordNet, an English word net database developed by Princeton University, to obtain the similarity based on Weibo semantics. Specifically, we use the path-length approach proposed in [37] to calculate semantic similarity, taking into account the length (depth) of the path between two words and their most recent common nodes in WordNet from the root node. The comparison with the cosine similarity method based on VSM-based method proves that this method can improve the accuracy and recall rate of finding relevant Weibo to some extent. In the aspect of simulating real-time Weibo retrieval, this paper studies the architecture and application of open source and real-time data processing platform Weibo Storm, and simulates the real-time and distributed processing of data in local mode. Specifically, this paper defines its own Weibo retrieval topology structure, and realizes the function of each node in the topology structure, including the preprocessing of twitter data set, the transmission of information between nodes, the parallel computation of multi-node similarity and the maintenance of similarity table. The search results are sorted based on similarity value, and the relevant Weibo is provided for each retrieval result, so the Weibo retrieval sorting is embedded into the Twitter Storm platform.
【学位授予单位】:武汉理工大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092;TP391.3
【参考文献】
相关期刊论文 前4条
1 晋耀红;基于语义的文本过滤系统的设计与实现[J];计算机工程与应用;2003年17期
2 张剑峰;夏云庆;姚建民;;微博文本处理研究综述[J];中文信息学报;2012年04期
3 文坤梅;徐帅;李瑞轩;辜希武;李玉华;;微博及中文微博信息处理研究综述[J];中文信息学报;2012年06期
4 刘晓华;韦福如;段亚娟;周明;;基于语义分析的微博搜索[J];山东大学学报(理学版);2012年05期
相关博士学位论文 前1条
1 宋万鹏;短文本相似度计算在用户交互式问答系统中的应用[D];中国科学技术大学;2010年
本文编号:2094589
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2094589.html