矿山设备领域主题爬虫的设计与实现
发布时间:2018-10-18 08:45
【摘要】:随着社会和互联网技术的飞速发展,人们获取信息的途径逐渐由传统的方式向互联网中的搜索引擎过渡。在浩瀚如海的网络信息中,人们开始关注能够快速获取准确有效的特定信息的主题搜索引擎。主题搜索是面对某一个特定的行业的搜索引擎,主题搜索引擎中主题爬虫是其重要的组成部分,主题爬虫爬取信息效率以及信息的准确性的好与不好都会影响到搜索结果的质量。一个优质的主题爬虫可以快速准确的爬取互联网中的有效信息,本文以主题爬虫为对象,对其相关技术做出了分析和研究,目的在于建立一个矿山设备领域的主题爬虫系统。 本文介绍了搜索引擎的结构原理和发展、网络爬虫的搜索策略和工作原理等,以网络爬虫的工作流程为脉路对主题网络爬虫重点技术做了研究和分析,包括对基于关键字主题表示方法进行详细设计说明;对网页消噪和网页去重的方法进行分类研究;并对系统中关键技术点页面信息提取中的链接提取和内容提取进行了研究和设计;总结了三种分词方法的优缺点;计算文本相似度的方法重点介绍了向量空间模型和PageRank算法,向量空间模型的计算中涉及到权重的计算和特征选取。 文中可体现出矿山设备领域主题爬虫系统实现的全过程,通过分析研究主题爬虫的理论知识,对爬虫系统进行流程和结构设计,根据系统设计需求选择初始URL,并设计了该系统的数据库等。在系统相关性计算的算法中引入经典的向量空间模型算法,以此提高系统精确性能。系统实现中还介绍了该系统实现的相关细节,,并展示了系统运行时的相关界面。最终实现了矿山设备领域主题爬虫系统。
[Abstract]:With the rapid development of society and Internet technology, people's access to information gradually from the traditional way to the Internet search engine transition. In the vast network information, people begin to pay attention to the topic search engine which can obtain accurate and effective information quickly. Subject search is a search engine facing a specific industry. Theme crawler is an important part of theme search engine. Topic crawler crawling information efficiency and information accuracy will affect the quality of search results. A high quality topic crawler can quickly and accurately crawl the effective information in the Internet. This paper analyzes and studies the related technology of the topic crawler in order to set up a subject crawler system in the field of mine equipment. In this paper, the structure and development of search engine, search strategy and working principle of web crawler are introduced. It includes the detailed design and description of the method based on keyword theme, the classification and research of the methods of web page denoising and web page denoising. And the key technology in the system point page information extraction link extraction and content extraction research and design; summed up the advantages and disadvantages of three word segmentation methods; the text similarity calculation method focuses on the introduction of vector space model and PageRank algorithm, The calculation of vector space model involves the calculation of weights and feature selection. The whole process of realizing themed crawler system in the field of mine equipment can be embodied in this paper. By analyzing and studying the theoretical knowledge of themed crawler, the process and structure of crawler system are designed. According to the system design requirements, the initial URL, is selected and the database of the system is designed. In order to improve the accuracy of the system, the classical vector space model algorithm is introduced in the algorithm of system correlation calculation. In the implementation of the system, the details of the system implementation are also introduced, and the interface of the system running time is shown. Finally, the subject crawler system in the field of mine equipment is realized.
【学位授予单位】:河北工程大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
本文编号:2278602
[Abstract]:With the rapid development of society and Internet technology, people's access to information gradually from the traditional way to the Internet search engine transition. In the vast network information, people begin to pay attention to the topic search engine which can obtain accurate and effective information quickly. Subject search is a search engine facing a specific industry. Theme crawler is an important part of theme search engine. Topic crawler crawling information efficiency and information accuracy will affect the quality of search results. A high quality topic crawler can quickly and accurately crawl the effective information in the Internet. This paper analyzes and studies the related technology of the topic crawler in order to set up a subject crawler system in the field of mine equipment. In this paper, the structure and development of search engine, search strategy and working principle of web crawler are introduced. It includes the detailed design and description of the method based on keyword theme, the classification and research of the methods of web page denoising and web page denoising. And the key technology in the system point page information extraction link extraction and content extraction research and design; summed up the advantages and disadvantages of three word segmentation methods; the text similarity calculation method focuses on the introduction of vector space model and PageRank algorithm, The calculation of vector space model involves the calculation of weights and feature selection. The whole process of realizing themed crawler system in the field of mine equipment can be embodied in this paper. By analyzing and studying the theoretical knowledge of themed crawler, the process and structure of crawler system are designed. According to the system design requirements, the initial URL, is selected and the database of the system is designed. In order to improve the accuracy of the system, the classical vector space model algorithm is introduced in the algorithm of system correlation calculation. In the implementation of the system, the details of the system implementation are also introduced, and the interface of the system running time is shown. Finally, the subject crawler system in the field of mine equipment is realized.
【学位授予单位】:河北工程大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 于成龙;于洪波;;网络爬虫技术研究[J];东莞理工学院学报;2011年03期
2 薛惠;何栋;马静媛;;基于AHP方法构建教学评价指标的研究[J];电脑知识与技术;2009年12期
3 常育红,姜哲,朱小燕;基于标记树表示方法的页面结构分析[J];计算机工程与应用;2004年16期
4 张汛涞;搜索引擎的设计剖析[J];计算机工程与科学;2002年04期
5 施聪莺;徐朝军;杨晓江;;TFIDF算法研究综述[J];计算机应用;2009年S1期
6 刘朋;林泓;高德威;;基于内容和链接分析的主题爬虫策略[J];计算机与数字工程;2009年01期
7 李卫;刘建毅;何华灿;王枞;;基于主题的智能Web信息采集系统的研究与实现[J];计算机应用研究;2006年02期
8 刘金红;陆余良;;主题网络爬虫研究综述[J];计算机应用研究;2007年10期
9 王兰波,张积友,范冰冰;国内信息导航系统中搜索引擎Robot的设计与实现[J];计算机应用与软件;2001年03期
10 张保富;施化吉;马素琴;;基于TFIDF文本特征加权方法的改进研究[J];计算机应用与软件;2011年02期
本文编号:2278602
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2278602.html