基于网络爬虫的信息采集分类系统设计与实现

发布时间：2018-07-29 10:45

【摘要】：在互联网走进世界每一个角落的今天,互联网信息在不断地膨胀,每日互联网将产生大量的数据,其中涵盖了每天发生发展的各种各样的事件,可谓覆盖人们生产生活的方方面面,这其中包含了大量富有价值的数据,同时又有绝大部分我们不关心的数据,如何从如此海量的信息中抽取有价值的数据,是我们急需思考的问题。系统使用蜘蛛爬虫技术,结合实际需求开发互联网采集系统,使用定向采集思想,快速定位采集符合业务需求的互联网数据,然后将采集结果数据通过文本聚类,归类出符合特性条件的数据集合,以方便后续其他业务的数据支持。本系统采用java语言面向对象的思想,lucene搜索引擎技术做底层数据检索支持,开源的中文分词器IK,应用方面实现SSH经典Web开发框架,展现一个简单的互联网信息采集分类系统。系统能够为有互联网数据分析需求的个人、企业或者政府提供需求数据的先期过滤聚类,为各种复杂业务的数据分析提供一期标准化数据,在当今这个数据时代里,能发挥很好的作用。
[Abstract]:Today, when the Internet enters every corner of the world, the Internet information is constantly expanding, and the daily Internet will produce a large amount of data, which covers all kinds of events that take place every day. It can be described as covering all aspects of people's production and life, which includes a lot of valuable data, and at the same time, most of the data that we don't care about, how to extract valuable data from such a huge amount of information. It is a problem we urgently need to think about. The system uses spider and reptile technology to develop the Internet acquisition system combined with the actual demand, uses the orientation collection idea, collects the Internet data according to the business demand quickly, then collects the result data through the text clustering. Classifies the characteristic data set to facilitate the data support of other business. In this system, the object oriented search engine technology of java language is used to support the underlying data retrieval, the open source Chinese word segmentation device is IK. the SSH classic Web development framework is implemented in the application aspect, and a simple information collection and classification system is presented. The system can provide pre-filtering clustering for individuals, enterprises or governments who have the demand for Internet data analysis, and provide a standardized data for the data analysis of various complex businesses. In this data age, Can play a good role.
【学位授予单位】：厦门大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.52

【参考文献】