基于动态概念图的主题网络爬虫的设计与分析

发布时间：2018-06-15 09:00

本文选题：主题网络爬虫 + 网页分块　；参考：《辽宁科技大学》2013年硕士论文

【摘要】：网络信息时代的到来使得网络中的信息量呈指数增长，这使得研究如何从网页中高效地提取出有用信息成为网络信息检索领域中的重要课题。通用搜索引擎对Internet海量的数据和爆炸式增长的趋势显得无能为力，，同时用户对数据的全面性和更新速度有了更高的需求，他们面向的不仅仅只是针对某一关键词，而是对某一主题或领域，这就导致了主题网络爬虫的出现。主题网络爬虫是主题搜索引擎的基础和重要组成部分，其设计目标是尽可能搜集与特定主题相关的网页，同时尽可能剔除与主题无关的网页，有效地利用网络带宽和节约存储空间，提高主题网络爬虫的爬行效率和主题覆盖率。本文从主题网络爬虫的特点出发对其进行了详细的研究，主要有以下几方面工作： 1.基于网页的两大基本特征提出了一种通过检测出的分隔条直接对网页分块的算法，并用相对位置排版的概念解决了在部分分块的高度未知的情况下如何表示各分块的相对位置的问题，通过限制分块的总数及节点的字符长度、宽高信息等综合决定此节点是否可被继续分割，优先利用了统一性进行分块从而大幅度提高分块效率，直接通过检测分隔条进行分块，使用节点特征序列树避免了对同一节点的大量重复信息提取。此算法是自顶向下，非常高效的。 2.首先提出网站的三大观察理论，并根据这些理论得出一些结论，比如：结合网页分块及网页风格的统一性实现了内容页的判断；根据网站稳定性提出算法服务器的概念；根据对同一主题的分类与归类的相似性提出了基于动态概念加权有向图的主题网络爬虫并给出概念图的框架。 3.主题相关性计算使用加权求值的方法对各种因素进行了综合，引了入层的概念来表示关键词距离主题的远近，在层权值计算方面对关键词进行了更为细致的划分，把基于概念图的预测节点纳入主题相关性预测中。 4.给出了概念图的节点结构，并基于此得出概念图的动态更新方法。为了保证主题的可扩展性同时避免主题偏移，提出了专用词的概念，并针对两种不同的主题扩展方式给出相应的扩展方法。
[Abstract]:With the advent of the era of network information, the amount of information in the network increases exponentially, which makes the research on how to extract useful information from web pages efficiently become an important topic in the field of network information retrieval. The general search engine is powerless to cope with the huge amount of Internet data and the explosive growth trend, and users have a higher demand for the comprehensiveness and update speed of the data. It is about a topic or a domain, which leads to the emergence of thematic web crawlers. Topic web crawler is the foundation and important part of theme search engine. Its design goal is to collect as many pages as possible related to a particular topic, and to remove as many pages as possible that are not related to the subject. The efficiency and coverage of topic crawler can be improved by using network bandwidth and saving storage space. Based on the characteristics of the topic web crawler, this paper makes a detailed study on it, mainly as follows: 1. Based on the two basic features of web pages, this paper proposes an algorithm for dividing web pages directly by detecting the separation bars. The concept of relative position layout is used to solve the problem of how to represent the relative position of each block when the height of the partial block is unknown. By limiting the total number of blocks and the character length of nodes, The combination of width and height information determines whether the node can continue to be partitioned, and the unity is first used to divide the block, thus greatly improving the efficiency of the partition, and dividing the node directly through the detection of the splitter bar. The feature sequence tree is used to avoid the repeated information extraction from the same node. This algorithm is top-down, very efficient. 2. 2. First, three observation theories of website are put forward, and some conclusions are drawn according to these theories. For example, the judgment of content pages is realized with the combination of web page partitioning and the unity of web page style, the concept of algorithm server is put forward according to the stability of website. According to the similarity of classification and classification of the same topic, a topic web crawler based on dynamic concept weighted directed graph is proposed and the framework of concept graph is given. The method of weighted evaluation is used to synthesize all kinds of factors, the concept of entering layer is introduced to express the distance from the topic, and the key words are classified in detail in the calculation of layer weight. The concept map-based prediction node is included in the topic correlation prediction. 4. 4. The node structure of the concept graph is given, and the dynamic updating method of the concept graph is obtained. In order to ensure the extensibility of the topic and avoid the topic deviation, the concept of special words is proposed, and the corresponding extension methods are given for two different ways of topic expansion.
【学位授予单位】：辽宁科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】