面向Web文本挖掘的主题网络爬虫研究

发布时间：2018-11-25 07:24

【摘要】：随着Web3.0时代的到来,互联网中Web页面的数量和复杂性呈现出爆炸性增长趋势,伴随的是包含在Web页面中的信息也呈几何数量级增长。Web页面信息通常是由Web页面中的文本体现出来的,因此Web文本数据中隐藏着丰富的,对用户有价值的知识和规则。但是由于Web文本数据半结构化、实时性和离散性等特点,用户很难直接从如此复杂的数据集中获取到自己需要的知识。因此如何有效的从海量的Web本文数据中挖掘出用户真正关心的信息和知识,并以用户能够理解的方式呈现出来,是当下非常热门的研究课题。本文主要从获取Web文本数据和对Web文本数据的分析两方面着手,对如何准确且高效的获取用户所需要的Web文本信息,并挖掘其中有价值的知识展开研究。本文具体的研究工作如下:主题网络爬虫:首先综合分析了现有的主题网络爬虫实现的原理及结构,然后对主题网络爬虫的分类进行介绍,选择功能型主题网络爬虫为本文研究的重点。最后分析了网络爬虫实现语言,选择Node.js这门新兴语言来实现针对主题网络社区的主题网络爬虫。Web文本表示模型:首先综合分析了现有的文本表示模型,然后从本文所面对的Web文本数据以短文本为主的实际情况出发,结合自然语言处理中关键词提取和词向量表示的相关技术,提出一种基于关键词向量的文本表示模型。Web文本聚类算法:首先介绍了Web文本挖掘技术的定义。其次详细介绍了Web文本挖掘中的聚类挖掘技术。在分析了Web文本聚类算法分类的基础上,选取BIRCH算法为本文的Web文本聚类算法,然后在分析了BIRCH算法缺点和不足,并提出一种新的Web文本聚类算法。在以上研究内容的基础上,将Web文本挖掘技术和主题网络爬虫技术的研究成果相结合,设计并实现了面向主题网络社区的信息获取与分析系统。
[Abstract]:With the advent of the Web3.0 era, the number and complexity of Web pages in the Internet show an explosive growth trend. The information contained in the Web page also increases in geometric order. The information of the Web page is usually reflected by the text in the Web page, so there are abundant knowledge and rules in the Web text data that are valuable to the user. However, due to the semi-structured, real-time and discrete characteristics of Web text data, it is difficult for users to obtain the knowledge they need directly from such a complex data set. Therefore, how to effectively mine the information and knowledge that users really care about from the massive Web data, and present it in a way that users can understand, is a very hot research topic. This paper mainly starts from two aspects: obtaining Web text data and analyzing Web text data. It studies how to accurately and efficiently obtain the Web text information needed by users and mine the valuable knowledge. The specific research work of this paper is as follows: firstly, the principle and structure of the implementation of topic web crawler are synthetically analyzed, and then the classification of theme web crawler is introduced. Select functional theme web crawler as the focus of this study. Finally, this paper analyzes the implementation language of web crawler, and chooses Node.js as a new language to implement the text representation model of topic web crawler. Web text representation model for topic network community is implemented. Firstly, the existing text representation model is analyzed synthetically. Then, based on the fact that the Web text data in this paper is mainly short text, combined with the related techniques of keyword extraction and word vector representation in natural language processing, This paper presents a text representation model based on keyword vector. Web text clustering algorithm: firstly, the definition of Web text mining technology is introduced. Secondly, the clustering mining technology in Web text mining is introduced in detail. On the basis of analyzing the classification of Web text clustering algorithm, BIRCH algorithm is selected as the Web text clustering algorithm in this paper. Then, the shortcomings and shortcomings of BIRCH algorithm are analyzed, and a new Web text clustering algorithm is proposed. On the basis of the above research, this paper designs and implements the information acquisition and analysis system for the topic network community by combining the research results of Web text mining technology and topic web crawler technology.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP393.09

【参考文献】