基于本体进化的专题信息采集方法研究

发布时间：2019-02-08 18:56

【摘要】：互联网的出现，为人们提供了一个获取信息的新渠道。人们在拥有一个呈爆炸式增长的信息源的同时，也面临着如何从中快速准确地获取与特定专题相关信息的难题。通用搜索引擎是目前最为常用的信息检索工具，但由于其自身是面向大众，很难及时、准确地为人们提供特定的专题信息。在这种情形下，面向专题的信息采集已然成为当前的研究热点之一。本文中，首先对国内外专题信息采集技术和本体进化的研究现状作了简单概述，介绍了网络信息采集技术的基本原理和结构，以及主要的发展方向，同时对文本相似度计算理论和本体相关理论进行了梳理。然后，针对互联网上几种信息来源设计相应的采集策略，包括目标网站全站遍历、目标版块定向跟踪、RSS源定时增量更新。然后设计专题本体进化方案，主要内容有网页内容提取、正文特征词抽取、初始专题本体构建以及专题本体的进化。最后，设计实现实验系统，选取示例专题，构建初始专题本体，对本文提出的方法进行实验验证。本文的主要工作在于：①针对不同的信息源设计相应的采集策略，使信息采集器能适应互联网上复杂的信息采集环境，在专题本体的指导下，从互联网上的多种信息源中采集专题相关信息；②提出了专题本体半自动进化的方法，基于网页集和用户行为日志，结合特征词抽取技术，在用户的指导下实现专题本体的进化，，并通过实验验证方案的有效性。
[Abstract]:The emergence of the Internet provides a new channel for people to obtain information. At the same time, people are faced with the problem of how to obtain information related to a specific topic quickly and accurately. General search engine is the most commonly used information retrieval tool at present, but it is difficult to provide specific information for people in time and accurately because it is oriented to the public. In this case, subject-oriented information collection has become one of the current research hotspots. In this paper, first of all, the research status of thematic information collection technology and ontology evolution at home and abroad is briefly summarized, and the basic principle and structure of network information collection technology, as well as the main development direction, are introduced. At the same time, the theory of text similarity calculation and ontology theory are combed. Then, the corresponding acquisition strategies are designed for several information sources on the Internet, including the target site traversing the whole station, the target block orientation tracking, and the RSS source timing incremental update. Then we design an evolutionary scheme of thematic ontology, which includes web page content extraction, text feature extraction, initial topic ontology construction and thematic ontology evolution. Finally, the experimental system is designed and implemented, and the experimental verification of the proposed method is carried out by selecting the sample topic and constructing the initial thematic ontology. The main work of this paper is as follows: 1 according to different information sources, the information collector can adapt to the complex information collection environment on the Internet, under the guidance of the subject ontology, Collecting relevant information from a variety of information sources on the Internet; 2. A semi-automatic evolution method of thematic ontology is proposed. Based on web pages and user behavior logs, the evolution of thematic ontology is realized under the guidance of users, and the effectiveness of the scheme is verified by experiments.
【学位授予单位】：南京航空航天大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1

【参考文献】