基于网络爬虫的网站信息采集技术研究

发布时间：2018-05-19 07:55

本文选题：信息采集 + 信息抽取　；参考：《大连海事大学》2014年硕士论文

【摘要】：随着互联网的迅速普及发展,它已经逐渐融入人们日常生活的方方面面。其中Web是人们在互联网上互相沟通、获取外界信息的重要途径。作为一个很有价值的信息来源,Web凭借其直观便利的使用方式以及丰富的内容表达能力,可以为用户提供多种形式的信息,例如文本、音频、视频等。随着时间的推移,互联网的信息规模及其用户群体规模也在快速增长。互联网用户的需求正在变得越发多样化,如何为用户快速地提供其所感兴趣的信息是目前的一大难题。如今自媒体已经在互联上逐渐开始兴起,并且其规模越来也庞大,其中不乏各行各业优秀代表人物,因而开始受到越来越多的关注。因此本文提出运用一定的技术手段实现对百度百家这一自媒体平台完成采集其站点内的文章内容。然后对所采集的文章内容进行重新组织,以利于对这些内容的二次利用。围绕这一目标,本文提出了基于网络爬虫的网站信息采集技术的整合方案的设计与实现。本文提出的基于网络爬虫的网站信息采集技术的整合方案包括信息采集、信息抽取、信息检索这三部分。其中信息采集是基于Heritrix爬虫的扩展(结合HtmlUnit)所实现,负责完成对目标站点的网页采集；信息抽取是基于Jsoup和DOM技术所实现,负责完成从网页中抽取文章信息保存至数据库中,将非结构化信息转化成结构化信息；信息检索是基于Lucene索引工具以及SSH2架构所实现,负责向呈现所采集的文章信息,便于用户浏览。
[Abstract]:With the rapid development of the Internet, it has gradually integrated into all aspects of people's daily life. Among them, Web is an important way for people to communicate with each other and obtain external information on the Internet. As a valuable source of information, Web can provide users with various forms of information, such as text, audio, video and so on. With the passage of time, the information scale of the Internet and the size of its user groups are also growing rapidly. The needs of Internet users are becoming more and more diverse. How to quickly provide information of interest to users is a big problem. Now the media has started to rise gradually in the interconnection, and its scale has become larger and larger, among which there are many outstanding representatives of various industries, so it began to get more and more attention. Therefore, this paper proposes to use certain technical means to complete the collection of articles on Baidu 100 self-media platform. Then the collected content of the article is reorganized to facilitate the secondary use of these contents. Around this goal, this paper puts forward the design and implementation of the integration scheme of Web crawler based website information collection technology. The integration scheme of Web site information collection technology based on web crawler in this paper includes three parts: information collection, information extraction and information retrieval. The information collection is based on the extension of Heritrix crawler (combined with HtmlUnit), which is responsible for accomplishing the web page collection of the target site, and the information extraction is based on the technology of Jsoup and DOM, which is responsible for extracting the article information from the web page and storing it into the database. The information retrieval is based on the Lucene indexing tool and the SSH2 framework, which is responsible for presenting the collected article information and making it easy for users to browse.
【学位授予单位】：大连海事大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】