基于群体特征的页面抽取方法研究

发布时间：2018-03-08 17:01

本文选题：页面抽取　切入点：页面聚类　出处：《中国地质大学(北京)》2017年硕士论文　论文类型：学位论文

【摘要】：随着互联网的不断发展,Web已经成为了世界上最大的信息载体。大数据技术的出现为我们提供了获取海量数据的能力。互联网2.0时代的到来,使得信息分发成为了日常获取信息的一个重要的渠道。从互联网海量的页面中提取出这些有用的信息,对于信息的获取与利用具有十分重大的意义。Web页面常用的标记语言是HTML,是一种半结构化语言。常见的Web页面在生成时从数据库读取数据,对模板页面进行渲染得到最终的HTML代码。本文通过对这种Web页面生成方式进行研究总结,提出了基于DOM(Document Object Model)树模型的样本页面融合方式,利用融合后的结果进行节点变化度的统计,找到正文块节点,自动归纳学习出抽取规则的方法。并且在此基础上设计了样本页面聚类的流程,实现了从海量页面中聚集相同模板页面的方法。同时本文针对网站改版导致抽取规则失效的问题进行了重点研究,通过对样本页面聚类的流程进行改进,实现了抽取规则对页面结构变化自适应的功能,真正意义上实现了抽取的自动化。同时利用抽取规则与链接泛化结果对页面进行进一步聚类,从而实现了样本分组的精细化与结构变化的自适应。本文在提出的抽取规则提取算法与样本页面采集框架的基础上设计并实现了一套完成的抽取系统。系统根据算法框架的设计共分为四个模块:样本采集模块、模板提取模块、页面抽取模块、控制调度模块。其中:前三个模块独立运行,可以方便的进行分布式部署;控制调度模块控制着前三个模块的工作流程与数据流动方向。各个模块之间通过网络通信进行交互,通过这种方式不仅保证了系统的高可用性,同时也满足了高吞吐的需求。经实际生产环境证明,该系统可以良好的运行在日均千万级别的抽取环境下。同时在对新闻类页面进行抽取时,抽取结果的查全率与查准率均可以达到很高的水平。
[Abstract]:With the continuous development of the Internet, Web has become the largest information carrier in the world. The emergence of big data technology provides us with the ability to obtain massive data. Information distribution has become an important channel for obtaining information on a daily basis. The useful information is extracted from the massive pages of the Internet. It is of great significance to obtain and utilize information. The commonly used markup language for web pages is HTML, which is a semi-structured language. The final HTML code is obtained by rendering the template pages. This paper proposes a sample page fusion method based on the DOM(Document Object Model tree model through the research and summary of this Web page generation method. Using the fusion results to calculate the degree of change of the nodes, find out the node of the text block, and automatically induce and learn the method of extracting rules, and on this basis, design the flow of the clustering of sample pages. The method of aggregating the same template pages from massive pages is realized. At the same time, this paper focuses on the problem that the website revision results in the invalidation of the extraction rules, and improves the clustering process of the sample pages. The function of adapting extraction rules to the change of page structure is realized, and the automation of extraction is realized. At the same time, the extraction rules and link generalization results are used to further cluster the pages. In this paper, we design and implement a complete extraction system based on the proposed extraction rule extraction algorithm and sample page acquisition framework. The design of the algorithm framework is divided into four modules: sample acquisition module, Template extraction module, page extraction module, control scheduling module. Among them: the first three modules run independently, can be conveniently distributed deployment; The control scheduling module controls the workflow of the first three modules and the direction of data flow. At the same time, it also meets the demand of high throughput. The actual production environment proves that the system can run well in the extraction environment with daily average of ten million levels. At the same time, when extracting news pages, The recall rate and precision rate of the extracted results can reach a very high level.
【学位授予单位】：中国地质大学(北京)
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】