面向Web文本挖掘的主题网络爬虫研究
[Abstract]:With the advent of the Web3.0 era, the number and complexity of Web pages in the Internet show an explosive growth trend. The information contained in the Web page also increases in geometric order. The information of the Web page is usually reflected by the text in the Web page, so there are abundant knowledge and rules in the Web text data that are valuable to the user. However, due to the semi-structured, real-time and discrete characteristics of Web text data, it is difficult for users to obtain the knowledge they need directly from such a complex data set. Therefore, how to effectively mine the information and knowledge that users really care about from the massive Web data, and present it in a way that users can understand, is a very hot research topic. This paper mainly starts from two aspects: obtaining Web text data and analyzing Web text data. It studies how to accurately and efficiently obtain the Web text information needed by users and mine the valuable knowledge. The specific research work of this paper is as follows: firstly, the principle and structure of the implementation of topic web crawler are synthetically analyzed, and then the classification of theme web crawler is introduced. Select functional theme web crawler as the focus of this study. Finally, this paper analyzes the implementation language of web crawler, and chooses Node.js as a new language to implement the text representation model of topic web crawler. Web text representation model for topic network community is implemented. Firstly, the existing text representation model is analyzed synthetically. Then, based on the fact that the Web text data in this paper is mainly short text, combined with the related techniques of keyword extraction and word vector representation in natural language processing, This paper presents a text representation model based on keyword vector. Web text clustering algorithm: firstly, the definition of Web text mining technology is introduced. Secondly, the clustering mining technology in Web text mining is introduced in detail. On the basis of analyzing the classification of Web text clustering algorithm, BIRCH algorithm is selected as the Web text clustering algorithm in this paper. Then, the shortcomings and shortcomings of BIRCH algorithm are analyzed, and a new Web text clustering algorithm is proposed. On the basis of the above research, this paper designs and implements the information acquisition and analysis system for the topic network community by combining the research results of Web text mining technology and topic web crawler technology.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP393.09
【参考文献】
相关期刊论文 前6条
1 吴威;;基于Web文本挖掘算法预防现实危害的研究[J];信息网络安全;2016年09期
2 薛苏琴;牛永洁;;基于向量空间模型的中文文本相似度的研究[J];电子设计工程;2016年10期
3 史玉珍;单冬红;;基于子主题选择与三级分层结构的Web文本挖掘方法[J];电信科学;2016年05期
4 张志昌;周慧霞;姚东任;鲁小勇;;基于词向量的中文词汇蕴涵关系识别[J];计算机工程;2016年02期
5 俞忻峰;;社交网络挖掘方案研究[J];现代电子技术;2015年04期
6 许鑫;郭金龙;姚占雷;;基于Web文本挖掘的行业态势分析——以2011上海车展为例[J];图书情报工作;2012年16期
相关硕士学位论文 前10条
1 刘小云;网络爬虫技术在云平台上的研究与实现[D];电子科技大学;2016年
2 王琨;面向教育舆情的主题网络爬虫设计与实现[D];南华大学;2015年
3 陈千;主题网络爬虫关键技术的研究与应用[D];北京理工大学;2015年
4 杨志国;基于WEB挖掘和文本分析的动态网络舆情预警研究[D];武汉理工大学;2014年
5 唐东;基于XML和SVM的Web文本挖掘系统研究[D];电子科技大学;2014年
6 汤卓;基于Web文本挖掘的网络口碑分析系统的设计与实现[D];华中科技大学;2013年
7 仰孝富;基于BIRCH改进算法的文本聚类研究[D];北京林业大学;2013年
8 赵茉莉;网络爬虫系统的研究与实现[D];电子科技大学;2013年
9 张宏兵;Web文本挖掘技术在网页推荐中的应用研究[D];南京理工大学;2013年
10 张晓雷;面向Web挖掘的主题网络爬虫的研究与实现[D];西安电子科技大学;2012年
,本文编号:2355288
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2355288.html