基于Spark的新闻网页分类系统的设计与实现

发布时间：2018-05-03 10:49

本文选题：网页分类 + 网页结构信息　；参考：《北京邮电大学》2017年硕士论文

【摘要】：互联网的发展日新月异。时至今日,互联网已经成为一个完善的庞大的系统,其中的信息不仅数量巨大,而且实时性好。互联网的这些优点使得我们越来越依赖互联网去获取外界信息。但是因为互联网的开放性和异构性,网络信息纷繁复杂,而从如此大量而缺乏规律的网络信息中很难准备地找到需要的信息。另外,很多时候希望过滤某些类别的网页。网页分类技术是一种解决以上问题的有效方法,该技术对互联中的网页进行统一的组织和处理以达到用户使用便捷化和资源利用高效化的目的。本文对传统网页分类整个流程进行了较为深入的研究,对其中的网页信息提取、特征选择、特征项权值计算、分类方法进行了研究和分析。在此基础上所做的主要工作有:1)针对以往网页分类方法中忽略文本语义层次信息的缺陷,引入主题模型,提出基于向量空间模型结合主题模型的分类方法,分别使用改进的方法和传统的方法在相同的数据集合上进行对比实验,实验结果显示引入LDA模型后,在所有类别上分类效果都有提升。2)针对以往网页分类方法中忽略网页的结构信息的缺陷,基于网页结构信息对TF-IDF进行改进,对相同的数据集分别使用传统的TF-IDF和改进的TF-IDF向量化文本,使用相同的SVM分类方法进行对比实验,实验结果显示考虑网页结构信息后会提升分类效果。3)针对以往网页分类中将网页当作孤立对象处理,不考虑网页间联系的缺陷,使用网页关系信息对随机森林方法进行改进,设计实验证明了改进的随机森林比原始的随机森林方法分类效果更佳。4)在理论研究的基础上,实现了一个基于Spark的网页分类系统,主要模块包括网页爬取模块、网页预处理模块和网页分类模块。
[Abstract]:The development of the Internet is changing with each passing day. Today, the Internet has become a complete huge system, in which the amount of information is not only huge, but also real-time. These advantages of the Internet make us rely more and more on the Internet to obtain external information. However, because of the openness and heterogeneity of the Internet, the network information is complicated, and it is difficult to find the needed information from such a large number of and lack of regular network information. In addition, there are times when you want to filter certain categories of pages. Web page classification technology is an effective method to solve the above problems. It organizes and processes web pages in interconnection in a unified way to achieve the purpose of user convenience and high efficiency of resource utilization. In this paper, the whole process of traditional web page classification is deeply studied, and the web page information extraction, feature selection, feature item weight calculation and classification method are studied and analyzed. The main work done on this basis is: (1) aiming at the defect of neglecting the semantic level information of text in the previous web page classification methods, a topic model is introduced, and a classification method based on vector space model and topic model is proposed. The improved method and the traditional method are used to compare the same data set. The experimental results show that the LDA model is introduced. The classification effect in all categories is improved. 2) aiming at the defect of ignoring the structural information of web pages in the previous methods of web page classification, the TF-IDF is improved based on the structure information of the web pages. For the same data set, the traditional TF-IDF and the improved TF-IDF vectorized text are used respectively, and the same SVM classification method is used to carry on the contrast experiment. The experimental results show that considering the structure information of web pages will improve the classification effect. 3) aiming at the disadvantages of treating web pages as isolated objects and not considering the relationship between web pages, the random forest method is improved by using web pages' relational information. The experimental results show that the improved random forest classification method is better than the original random forest method. Based on the theoretical research, a web page classification system based on Spark is implemented. The main modules include the web crawling module. Page preprocessing module and web page classification module.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092

【参考文献】