分布式环境下企业新闻信息分类子系统的设计与实现

发布时间：2018-08-27 09:03

【摘要】：近年来,随着互联网的迅猛发展,各种各样的新闻层出不穷,新闻信息在人们的文化、生活等各个方面发挥着越来越重要的作用。如何对大量的新闻数据进行收集、整理,并突显出人们想要查找的新闻,是本文研究的主要问题。针对目前常见的搜索引擎存在着查找到的新闻信息过多,与主题关联性不强等问题,本文提出并设计了一个面向企业的新闻分类子系统。该系统具备新闻采集、信息处理及新闻展示等功能。企业用户可以利用该系统快速、准确地获取与其行业相关的新闻。首先,系统设计了网络爬虫模块。使用广度优先算法编写了爬虫软件,通过该软件可以实现对企业感兴趣新闻信息高效的采集与识别。其次,设计并实现了文本分类模块。在该模块中,使用分布式贝叶斯算法对新闻文本进行分类。在分类过程中,文本的预处理、特征选择以及向量化需要大量计算;在模型训练时,也存在着训练时间长、数据库存储容量有限等问题。为了解决以上问题,本文搭建了 Hadoop分布式计算平台,利用MapReduce并行计算模型对文本分类过程中的不同阶段进行了分布式并行处理,并建立Hive数据仓库以解决占用存储空间大的问题。当面临大量新增数据时,传统的贝叶斯方法需要将之前的所有样本数据全部重新学习一次,这样不仅会耗费大量时间,而且操作起来也相当麻烦。针对这种情况,本文引用了传统的增量学习方法,设计并实现了增量式贝叶斯算法,该方法不用重新训练数据,只需对原有的数据进行修正。最后设计了一个面向企业新闻信息的分类子系统,主要包括信息采集、文本预处理、特征提取、分类器构造、分类性能评估和增量学习几个流程,并对系统的几个模块功能进行了测试。本系统利用爬虫进行新闻信息的获取,并在Hadoop环境下对新闻信息进行分类。通过测试表明,在大规模新闻信息的情况下,Hadoop下的增量分类器相比于传统的贝叶斯分类器算法准确率提高4%左右,表现出了良好的执行效率及较高的拓展性。本文给出了网络新闻文本分类的实现方案,对其它领域的文本分类具有借鉴意义。
[Abstract]:In recent years, with the rapid development of the Internet, all kinds of news emerge in endlessly. News information plays a more and more important role in people's culture, life and other aspects. How to collect, sort out and highlight the news that people want to find is the main problem of this paper. Aiming at the problems of finding too much news information and not strong relevance to the topic in the common search engines, this paper proposes and designs an enterprise-oriented news classification subsystem. The system has the functions of news collection, information processing and news display. Enterprise users can use the system to quickly and accurately access news related to their industry. Firstly, the network crawler module is designed. The crawler software is programmed by using the breadth-first algorithm, through which the information of interest to enterprises can be collected and recognized efficiently. Secondly, the text classification module is designed and implemented. In this module, distributed Bayesian algorithm is used to classify news texts. In the process of classification, text preprocessing, feature selection and vectorization need a lot of computation, while in model training, there are many problems such as long training time and limited storage capacity of database. In order to solve the above problems, the Hadoop distributed computing platform is built, and the MapReduce parallel computing model is used to process the different stages of text classification. Hive data warehouse is established to solve the problem of occupying large storage space. When faced with a large number of new data, the traditional Bayesian method needs to re-learn all the previous sample data, which will not only consume a lot of time, but also be very troublesome to operate. In this paper, the traditional incremental learning method is cited, and an incremental Bayesian algorithm is designed and implemented. The method does not need to retrain the data, but only needs to modify the original data. Finally, a classification subsystem for enterprise news information is designed, which includes information collection, text preprocessing, feature extraction, classifier construction, classification performance evaluation and incremental learning. Several module functions of the system are tested. This system uses crawler to obtain news information, and classifies news information under Hadoop environment. The test results show that the accuracy of Hadoop incremental classifier is about 4% higher than that of the traditional Bayesian classifier under the condition of large-scale news information. It shows good execution efficiency and high expansibility. This paper gives the implementation scheme of network news text classification, which can be used for reference in other fields.
【学位授予单位】：延边大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP391.1

【相似文献】