网页信息智能采集与分类的研究与实现

发布时间：2018-04-23 01:28

本文选题：信息采集 + 信息抽取　；参考：《河北工业大学》2014年硕士论文

【摘要】：随着科学技术的飞速发展，我们已经进入了数字信息化时代。Internet作为当今世界上最大的信息库，也成为人们获取信息的最主要手段。由于网络上的信息资源有着海量、动态、异构、半结构化等特点，且缺乏统一的组织和管理，所以如何快速、准确地从海量的信息资源中寻找到自己所需的信息己经成为网络用户需要迫切解决的一大难题。因而基于Web的网络信息的采集与分类便成为人们研究的热点。传统的Web信息采集的目标就是尽可能多地采集信息页面，，甚至是整个Web上的资源，在这一过程中它并不太在意采集的顺序和被采集页面的相关主题。这就使得所采集页面的内容过于杂乱，大大消耗了系统资源和网络资源。这就需要采用有效的采集方法以减少采集网页的杂乱和重复等情况的发生。如何在较大程度上解决信息杂乱无章的现象，并方便用户准确地定位所需要的信息，仅靠人工的方式来分类是不切实际的。因此，网页自动分类是组织和管理信息的有效手段。这也是本文研究的一个重要内容。本文首先介绍了课题背景、研究意义和国内外的研究现状，阐述了网页采集和网页分类的相关理论、主要技术和算法，包括网页爬虫技术、网页去重技术、中文分词技术、特征提取技术、网页分类技术等。在此基础上，设计了网页信息智能采集与分类系统，本系统主要包括信息采集和信息分类两部分。信息采集部分，主要采用了基于主题的广度优先策略算法的网络爬虫和基于规则模板的网页信息抽取方法，把自由或者半结构化的数据转换成结构化的数据，同时采用基于数据库的信息排重和发布排重方法对信息进行排重。信息分类部分，根据用户的需求，通过采用分词和特征提取等技术相结合的SVM算法对信息进行分类，为用户提供全方位的信息服务。
[Abstract]:With the rapid development of science and technology, we have entered the digital information age. Internet, as the largest information base in the world today, has also become the most important means for people to obtain information. Because the information resources on the network have the characteristics of massive, dynamic, heterogeneous, semi-structured, and lack of unified organization and management, so how to quickly, It has become an urgent problem for network users to find the information they need from the massive information resources. Therefore, the collection and classification of network information based on Web has become a hot topic. The goal of traditional Web information collection is to collect as many information pages as possible, even the resources on the whole Web. In this process, it does not pay much attention to the order of collection and the related topics of the collected pages. This makes the content of the collected pages too messy, and consumes the system resources and network resources. It is necessary to adopt effective collection methods to reduce the clutter and repetition of web pages. It is impractical to classify the information in a manual way only by how to solve the disorder of information to a large extent and to locate the information accurately and conveniently. Therefore, the automatic classification of web pages is an effective means to organize and manage information. This is also an important part of this study. This paper first introduces the background of the subject, the significance of the research and the current research situation at home and abroad, and expounds the relevant theories, main techniques and algorithms of web page collection and classification, including web crawler technology, web page de-reduplication technology, Chinese word segmentation technology, etc. Feature extraction technology, web page classification technology and so on. On this basis, an intelligent web information collection and classification system is designed. The system mainly includes two parts: information collection and information classification. In the part of information collection, we mainly adopt the method of web crawler based on topic breadth-first strategy and web page information extraction based on rule template to transform free or semi-structured data into structured data. At the same time, the information weight based on database and the method of publishing weight are used to calculate the weight of the information. In the part of information classification, according to the needs of users, the SVM algorithm which combines word segmentation and feature extraction is used to classify the information to provide users with comprehensive information services.
【学位授予单位】：河北工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【相似文献】