当前位置:主页 > 管理论文 > 移动网络论文 >

微博话题追踪方法研究与设计

发布时间:2018-09-14 13:47
【摘要】:摘要:现如今,互联网在人们的日常生活中扮演者越来越重要的角色,人们的工作生活都需要互联网的支持。随着互联网技术的日益发展,美国出现了Twitter这样的信息平台,而国内也出现了新浪、腾讯微博。借助微博平台,用户可以通过140以内的简短内容来发布消息,并且可以对感兴趣的微博进行转发评论。这样的高效的平台可以使一条有价值的新闻报道在短短几分钟内传遍全网,大大提高了用户获取最新消息的效率。然而,在信息爆炸的今天,对于海量的信息,人们可能显得无所适从。所以现在亟需一种方法来对信息进行整合处理,使得人们能够根据自己的需求获得自己想要的信息。 本文对微博文本表示进行了研究。针对微博篇幅短小、实时性、口语化、原创性的特点,在原有的空间向量模型的基础上,提出了适合于微博的文本表示方法。该方法在对微博处理之前,先滤除掉字数小于N的微博,在分词之后,将所有的实词作为特征词。并且针对微博的特点提出了T-TFIDF权重计算方法。这种权重计算方法加重了微博小标题中词的权重。通过这些改进,使空间向量能够更好的表示微博文本内容。能够根据微博中词语的重要程度,赋予相应的权重值。 将微博文本转化到向量空间中,在此基础上,本文提出了基于K-means聚类的微博自适应话题追踪方法。这种方法,可以根据用户给出的一到四篇微博,对实时采集到的微博语料进行追踪。经过微博与子话题向量集的相似度的比较,判别微博是否属于该话题。追踪的同时,对子话题向量集进行动态调整。具体的方法是,当微博被判定为属于该话题时,进行候选词的挑选及词频统计。若词频大于阈值,则判定出现新的子话题,并通过K-means聚类的方法对追踪到的微博进行聚类,并根据聚类结果对子话题向量集进行调整。这样子话题向量集就能够随着追踪到的微博进行动态调整,能够更精确的对话题继续追踪。 此外,本文还对自动文摘在微博中的应用作了研究。首先以子话题向量集作为初始聚类中心对追踪到的微博进行聚类。再进行句子权重的计算,选出每一类中权重最高的句子作为每一类的文摘句。最后将这些句子按时间顺序排序,得到最后的话题文摘。 论文的工作得到了国家自然科学基金(No.61172072,61271308)、北京市自然科学基金(No.4112045)、高等教育博士点基金(No. W11C100030)、北京科技计划(No.Z121100000312024)和北京市教育委员会学科建设与研究生建设项目等课题的支持。
[Abstract]:Absrtact: nowadays, the Internet plays a more and more important role in people's daily life. With the development of Internet technology, information platforms such as Twitter have emerged in the United States, while Sina and Tencent Weibo have emerged in China. With Weibo's platform, users can post messages through short content up to 140, and can forward comments to interested Weibo. Such an efficient platform enables a valuable news report to spread throughout the network in just a few minutes, greatly improving the efficiency of users' access to the latest news. However, in the information explosion today, for mass information, people may seem at a loss. Therefore, there is an urgent need for a way to integrate information so that people can get the information they want according to their needs. This paper studies the text representation of Weibo. Aiming at Weibo's characteristics of short space, real-time, colloquial and originality, a text representation method suitable for Weibo is put forward on the basis of the original space vector model. Before treating Weibo, the method filters out Weibo whose number of words is less than N, and takes all the notional words as feature words after participle. And according to Weibo's characteristic put forward T-TFIDF weight calculation method. This weight calculation method accentuates the weight of the words in Weibo subheading. Through these improvements, the space vector can better represent Weibo text content. Can according to Weibo in the important degree of words, give the corresponding weight value. On the basis of the transformation of Weibo text into vector space, this paper proposes an adaptive topic tracking method for Weibo based on K-means clustering. This method, according to the user given one to four Weibo, real-time collection of Weibo corpus tracking. By comparing Weibo with the similarity of subtopic vector set, we can judge whether Weibo belongs to this topic. At the same time, the subtopic vector set is dynamically adjusted. The specific method is, when Weibo is judged to belong to the topic, the candidate word selection and word frequency statistics. If the word frequency is greater than the threshold value, the new subtopic is judged, and then the tracked Weibo is clustered by K-means clustering method, and the subtopic vector set is adjusted according to the clustering result. In this way, the topic vector set can be dynamically adjusted with the tracking Weibo, and can continue to track the topic more accurately. In addition, the application of automatic abstracting in Weibo is also studied. Firstly, the subtopic vector set is used as the initial cluster center to cluster Weibo. Then the weight of each sentence is calculated and the sentence with the highest weight is selected as the abstract sentence of each class. Finally, these sentences are sorted in chronological order and the final topic abstracts are obtained. The work of this paper was obtained from the National Natural Science Foundation (No.61172072,61271308), the Beijing Natural Science Foundation (No.4112045) and the higher Education doctoral Foundation (No.). W11C100030), the Beijing Science and Technology Program (No.Z121100000312024) and the discipline Construction and Graduate Program of the Beijing Education Commission.
【学位授予单位】:北京交通大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092;TP391.1

【参考文献】

相关期刊论文 前4条

1 张晓艳;王挺;梁晓波;;LDA模型在话题追踪中的应用[J];计算机科学;2011年S1期

2 席耀一;林琛;李弼程;周杰;许旭阳;;基于语义相似度的论坛话题追踪方法[J];计算机应用;2011年01期

3 洪宇;张宇;刘挺;李生;;话题检测与跟踪的评测及研究综述[J];中文信息学报;2007年06期

4 范云杰;刘怀亮;;基于维基百科的中文短文本分类研究[J];现代图书情报技术;2012年03期



本文编号:2242884

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2242884.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户62082***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com