基于hadoop的微博文本分类及商业词抽取

发布时间：2019-02-19 18:49

【摘要】：随着计算机技术和网络技术的飞速发展，微博已经普及成为国内的一大新型媒体。微博用户基数的迅速膨胀，加上信息的逐级传播，与之俱来的问题是微博数据规模达到空前水平。面对微博服务迅猛发展中所产生的海量文本数据，如何准确有效的从中定向发现并获取所需要的有较高商业价值的资料和信息，进而提高广告精准度成为各微博平台数据研究处理的一大目标，本文将对如何有效的从海量微博文本数据中发现和抽取商业关键词进行研究。为了更有针对性的进行商业关键词抽取，首先对海量微博数据进行了文本分类，一方面降低了单次数据处理的规模，另一方面对同类数据进行处理研究将更有针对性。再对各个类别中文本的关键词结合互联网搜索引擎中的搜索权值进行调权，有效提高了微博文本中商业关键词抽取的精准度。由于微博文本数据具有总体数量多、单条简短及内容随意性大等特性，在利用传统分类方法及商业信息提取算法对其进行处理时存在一定的局限性。本文考虑到单条微博文本信息简短包含的有效特征少，且内容比较口语化的特性，从相似词及搭配词方面对文本的特征词进行了扩展，尽量降低特征丢失的可能性；结合微博文本数量多及内容随意性大的特性，提出了一种基于特征词类别分散性及分散程度的微博文本分类方法。考虑到微博自有的转发数、评论数及海量规模等因素，本文对传统的TF-IDF算法进行了相关改进，利用hadoop云计算平台并以单个用户的所有微博信息作为计算单元应用改进的TF-IDF算法，再综合词语在互联网搜索引擎中的搜索权值进行调权，实现了从海量数据中对具有商业价值关键词的有效抽取。实验表明，该微博分类方法在微博信息的分类中取得了较好的效果，在微博数据处理应用场景中，综合了改进的TF-IDF权重及词语互联网搜索权重的商业关键词抽取算法，，具有较好的适用性及商业效果。而结合了云计算平台后,一定程度上提高了数据处理效率，使得对海量微博数据集上的处理变得可行有效。
[Abstract]:With the rapid development of computer technology and network technology, Weibo has become a new media in China. Weibo's rapid expansion of the user base, coupled with the gradual dissemination of information, comes with the question of the unprecedented scale of Weibo data. In the face of the massive text data produced by Weibo in the rapid development of service, how to accurately and effectively find and obtain the materials and information of high commercial value needed from them, To improve the accuracy of advertising has become a major target of data processing in Weibo platform. This paper will study how to effectively find and extract commercial keywords from the massive Weibo text data. In order to extract business keywords more pertinently, the text classification of massive Weibo data is carried out first, which reduces the scale of single data processing on the one hand, and studies the same data processing on the other hand, it will be more targeted. Then the key words of each type of Chinese text combined with the search weight value in the Internet search engine are adjusted to effectively improve the accuracy of business keyword extraction in Weibo text. Because Weibo text data has many characteristics, such as large quantity, short and random content, there are some limitations in using traditional classification method and business information extraction algorithm to process Weibo text data. Considering that there are few effective features and colloquial features in a single Weibo text, this paper extends the feature words of the text from the aspects of similar words and collocation words to reduce the possibility of feature loss as far as possible. According to the characteristics of Weibo's large quantity of text and randomness of content, this paper puts forward a new text categorization method of Weibo based on the dispersion and dispersion of feature word categories. Considering the factors of Weibo's own forwarding number, comment number and massive scale, this paper improves the traditional TF-IDF algorithm. Using hadoop cloud computing platform and taking all Weibo information of individual user as computing unit, the improved TF-IDF algorithm is applied, and then the search weight value of words in Internet search engine is synthesized to adjust the weight. The effective extraction of commercial value keywords from massive data is realized. The experiment shows that the Weibo classification method has achieved good results in the classification of Weibo information. In the data processing and application scene of Weibo, the improved business keyword extraction algorithm of TF-IDF weight and word Internet search weight is integrated. It has good applicability and commercial effect. Combined with cloud computing platform, the efficiency of data processing is improved to a certain extent, which makes it feasible and effective to deal with the massive Weibo data set.
【学位授予单位】：杭州电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092;TP391.1

【参考文献】