基于标签路径特征的Web新闻内容抽取研究

发布时间：2018-12-15 16:46

【摘要】：Web新闻内容抽取是Web智能信息处理过程中的一个非常重要的步骤,是情报获取与安全、网络舆情监测、移动终端个性化推荐服务、异构Web数据集成、信息检索、搜索引擎等研究与应用的基础。因此,面向Web新闻内容抽取领域中的相关问题开展研究,具有重要的研究和应用价值。实例分析和进一步研究发现,许多新闻网站具有类似的布局结构和风格,网页内容布局与其解析树的标签路径之间存在隐含的关联性。传统的路径表达式过于刚性,在Web信息抽取过程中难以适应HTML文档结构的细微变化,影响信息抽取的准确率；此外,Web新闻网页具有海量异构的特点,对手工构造包装器技术以及基于规则学习的包装器技术的通用性提出了挑战。为此,本文开展基于标签路径特征的Web新闻内容抽取研究,研究内容涉及两方面：面向特定网站,研究基于路径模式知识的高精度Web新闻内容抽取模型和方法；面向开放环境,研究基于标签路径特征的通用Web新闻内容抽取模型和方法。主要研究内容如下： (1)在研究网页内容布局与其解析树的路径模式之间存在隐含关联性的基础上,提出了一种新颖的Web信息抽取系统模型—基于区分路径模式的Web新闻内容抽取模型PP-WNE。在此基础上,定义了一种特殊的适用于Web新闻内容抽取的路径模式—区分路径模式,并提出一种区分路径模式挖掘方法,解决了抽取模式知识库的构建问题。以中文、英文网站上随机选取的网页为实验数据集,实验结果表明,通过采用合理设置的容噪阈值,基于路径模式挖掘的新闻网页内容抽取方法的F值可达到98%以上,同时也验证了路径模式应用于Web新闻内容信息抽取领域的可行性和有效性。 (2)为解决基于路径模式的Web信息抽取模型PP-WNE中知识库规模的优化问题,提出区分路径模式覆盖问题,并证明了区分路径模式覆盖问题是一个NP-complete问题。为求解区分路径模式覆盖问题的近似最优解,定义了一种特殊的区分路径模式—极小区分路径模式,在此基础上,设计了一个求解区分路径模式覆盖问题的多项式时间(in|n|+1)近似算法MPM,其中,n为训练样本中正例的规模。在测试数据集上的实验结果表明,MPM算法可有效优化区分路径模式集,并且在节点级评估标准和文本级评估标准下均可达到98%以上的抽取精度、召回率和F值。 (3)面向开放环境Web新闻内容抽取的需求,设计了一种文本标签路径比特征,描述了基于网页解析树节点遍历的文本标签路径比计算过程,提出基于文本标签路径直方图区分内容和非内容的阈值方法CEPR,有效地解决了在线Web新闻内容抽取的问题;提出了基于路径编辑距离的加权高斯平滑方法,有效地提高了CEPR算法在抽取短文本方面的能力,并解决了新闻内容中非新闻内容过滤的问题。CEPR是一种快速的、通用的、无需训练的网页内容抽取算法,可抽取多种来源、多种风格、多种语言的Web信息网页。在CleanEval测试数据集上的实验结果表明,大多数情况下,CEPR方法优于CETR等抽取方法。 (4)设计并实现了一个HTML新闻网页过滤与总结系统NFaS。其中,提出并实现了一种基于URL特征、网页结构特征、内容属性特征相结合的Web新闻网页自动识别方法,有效地解决了Web新闻网页自动识别问题；采用Web新闻内容抽取技术,有效地解决了Web新闻网页过滤问题；采用一种基于词语语义联系的关键词抽取方法,通过词汇链构造词语语义联系图,抽取出高质量的关键词,完成Web新闻的总结任务。在测试数据集上的评估结果验证了NFaS系统的有效性。
[Abstract]:Web news content extraction is a very important step in the process of Web intelligent information processing, which is the basis of information acquisition and security, network public opinion monitoring, mobile terminal personalized recommendation service, heterogeneous Web data integration, information retrieval, search engine and other research and application. Therefore, the research on relevant problems in the field of Web-based news content extraction has important research and application value. An example analysis and further study found that many news websites have similar layout structure and style, and there is an implicit association between the content layout and the label path of the parse tree. The traditional path expression is too rigid, which is difficult to adapt to the fine change of the structure of the HTML document in the process of extracting the Web information, and the accuracy of the information extraction is affected; in addition, the web news web page has a mass of heterogeneous, The universality of the technology of the hand-constructed wrapper and the technology of the wrapper based on the rule learning is presented. In this paper, the research of Web news content extraction based on label-path feature is carried out in this paper. The content of the research is concerned with two aspects: the research of high-precision Web news content extraction model and method based on path-mode knowledge for a specific website; A General Web News Content Extraction Model and a Party Based on the Label-Path Feature A. Principal research The following is the following: (1) Based on the study of the implicit relationship between the content layout and the path pattern of the analysis tree, a novel Web information extraction system model based on the distinguishing path model is proposed. P-WNE, on the basis of which, defines a special path pattern for Web news content extraction, and proposes a method for distinguishing path pattern, which solves the knowledge base of extraction mode. The result of the experiment shows that the F value of the method for extracting the news web content based on the path pattern can be achieved by using the noise threshold which is reasonably set. At the same time, the application of the path model to the information extraction of Web news content is also verified. (2) To solve the problem of optimization of knowledge base scale in PP-WNE of Web information extraction model based on path model, the problem of path mode coverage is proposed, and it is proved that the problem of distinguishing path mode is an NP-com In order to solve the approximate optimal solution of the problem of different path mode coverage, a special path pattern for distinguishing path patterns is defined. On the basis of this, a polynomial time (in | n | + 1) is designed to solve the problem of covering the path pattern. Method MPM, where n is a training sample The experimental results on the test data set show that the MPM algorithm can effectively optimize the path pattern set, and can reach more than 98% of the extraction accuracy at the node level evaluation standard and the current level evaluation standard. and (3) a text label path specific feature is designed for the requirement of the open environment Web news content extraction, and the text based on the webpage analysis tree node traversal is described. The label path ratio calculation process is based on the text label path histogram distinguishing content and the non-content threshold method CEPR, which effectively solves the problem that the online Web news content is extracted; the weighted Gaussian smoothing method based on the path editing distance is proposed, and the CEPR algorithm is effectively improved The ability to take a short text, and solve the problem in the news content. The problem of filtering the news content. The CEPR is a fast, general-purpose, no-training webpage content extraction algorithm, which can be used to extract a variety of sources, a variety of styles, a variety of languages, Web-based information web pages. The experimental results on the CleanEval test data set show that, in most cases, the CEPR method is superior to CETR and other extraction methods. (4) Design and implement an HTML news web page In this paper, a new method for automatic identification of web news web page based on URL character, web structure features and content attribute features is proposed and implemented, and the automatic identification of Web news web pages is effectively solved. The web news content extraction technology effectively solves the problem of web news web page filtering, adopts a keyword extraction method based on the semantic contact of words, A summary task for web news. Validation of the evaluation results on the test data set
【学位授予单位】：合肥工业大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP391.1;TP393.092

【参考文献】