当前位置:主页 > 管理论文 > 移动网络论文 >

针对私人微博的自动摘要形成研究

发布时间:2018-06-15 16:37

  本文选题:私人微博 + 自动文摘 ; 参考:《内蒙古科技大学》2014年硕士论文


【摘要】:自2007年以来,微博这种通讯形式风靡全球。微博具有上手门槛低、交流及时、发布便捷等优势,在全球得以普及和发展。近年来微博的发展态势强劲,已成为人们生活中不可或缺的一部分。在国内,网民的微博用户数量激增,每日发布的博文条数多达上亿条,产生了大量的微博数据。大多数的微博内容随意,评论较多且口语化严重。如何在浩如烟海各型各色的微博数据中找到符合个人兴趣并能够提供有效信息的微博数据,成为了伴随着微博发展带来的一个巨大的问题。 本文以新浪微博为数据来源,以个人微博一个历史时间段内所发表的所有微博数据为单位进行研究。经过对自动文摘技术与微博数据特点的研究,并且结合文本表示、聚类算法等主题进行了探讨,设计且实现了一个从获取数据到数据处理到最终自动摘要形成的完整系统。这个过程中主要经历了以下步骤:获取数据、对数据进行预处理、文本表示、特征选择、相似度计算改进、聚类算法改进及算法实现和形成综合自动摘要。本论文主要工作有: 首先,通过新浪微博开放平台获取微博原始数据。 其次,对微博数据进行分析研究,结合私人微博文本特点把微博数据与评论内容合并成伪文档进行分词等一系列预处理工作。接下来,将分词后的文本转化成数据格式。文本模型把数据从文本形式转化成了数学的表示,反映了数据之间的关系,并在此基础上采用文本相似度的计算方法。 然后,聚类算法采用了K-means聚类算法。K值的指定一直都是K-means聚类算法的最大的问题,通常需要通过经验进行判断。中心点的选取也是一个较大的问题,通常中心点最好具有代表性,选取不同中心点的位置对算法结果的准确性也有较大影响。我们对此进行了改进,使得改进后的算法能够自适应地获取K的值,并选取中心点。 最后,根据微博的内容时效性和流行度,,确定聚类簇中各个微博的权重,先得到每个聚类中的摘要,最终结合各个聚类簇形成最终针对私人微博的摘要。论文的最后通过实验验证,对论文提出的聚类算法改进进行了分析和实验。相比于原先的算法准确率和适用性有所提高。通过整个系统开发实现了私人微博摘要的形成。
[Abstract]:Since 2007, Weibo, the form of communication is popular around the world. Weibo has the advantages of low threshold, timely communication, convenient distribution, and so on, so it can be popularized and developed in the world. In recent years, the development of Weibo has become an indispensable part of people's life. In China, the number of Weibo users has soared, and hundreds of millions of blog posts have been published daily, generating a lot of Weibo data. Most of the Weibo content is random, more comments and more colloquial. How to find the Weibo data in all kinds of Weibo data that accord with personal interest and provide effective information has become a huge problem along with the development of Weibo. This paper takes Sina Weibo as the data source, and studies all the Weibo data published in a historical period of personal Weibo. Based on the study of the characteristics of automatic summarization and Weibo data, and combining with the topics of text representation and clustering algorithm, a complete system from data acquisition to data processing to automatic summarization is designed and implemented. The main steps in this process are as follows: data acquisition, data preprocessing, text representation, feature selection, similarity calculation improvement, clustering algorithm improvement and algorithm implementation and formation of a comprehensive automatic summary. The main work of this paper is as follows: first, access Weibo raw data through Sina Weibo open platform. Secondly, the Weibo data is analyzed and studied, and a series of preprocessing work, such as combining Weibo data and comments into pseudo-documents, combining with the characteristics of private Weibo texts, is carried out. Next, the participle text is transformed into data format. The text model transforms the data from text form to mathematical representation, which reflects the relationship between the data. On this basis, the text similarity calculation method is adopted. Then, K-means clustering algorithm using K-means clustering algorithm. K value assignment has always been the biggest problem of K-means clustering algorithm, which usually needs to be judged by experience. The selection of the center point is also a big problem, usually the center point should be representative, and the location of different center point has a great influence on the accuracy of the algorithm. The improved algorithm can adaptively obtain the value of K and select the center point. Finally, according to the content timeliness and popularity of Weibo, the weight of each Weibo in the cluster is determined, and the summary of each cluster is obtained first, and finally the summary for private Weibo is formed by combining each cluster. At the end of this paper, the improvement of clustering algorithm is analyzed and experimented by experiment. Compared with the original algorithm, the accuracy and applicability of the algorithm are improved. The formation of private Weibo digest is realized through the whole system development.
【学位授予单位】:内蒙古科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092

【参考文献】

相关期刊论文 前10条

1 张阔;李涓子;吴刚;王克宏;;基于关键词元的话题内事件检测[J];计算机研究与发展;2009年02期

2 常鹏;马辉;;高效的短文本主题词抽取方法[J];计算机工程与应用;2011年20期

3 付剑锋;刘宗田;付雪峰;周文;仲兆满;;基于依存分析的事件识别[J];计算机科学;2009年11期

4 郑斐然;苗夺谦;张志飞;高灿;;一种中文微博新闻话题检测的方法[J];计算机科学;2012年01期

5 万小军,杨建武,陈晓鸥;文档聚类中k-means算法的一种改进算法[J];计算机工程;2003年02期

6 马玉春,宋瀚涛;Web中文文本分词技术研究[J];计算机应用;2004年04期

7 洪宇;张宇;刘挺;李生;;话题检测与跟踪的评测及研究综述[J];中文信息学报;2007年06期

8 彭泽映;俞晓明;许洪波;刘春阳;;大规模短文本的不完全聚类[J];中文信息学报;2011年01期

9 谢丽星;周明;孙茂松;;基于层次结构的多策略中文微博情感分析和特征抽取[J];中文信息学报;2012年01期

10 童薇;陈威;孟小峰;;EDM:高效的微博事件检测算法[J];计算机科学与探索;2012年12期



本文编号:2022699

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2022699.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户53495***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com