Web文本自动文摘的研究

发布时间：2018-04-01 22:21

本文选题：Web正文抽取　切入点：主题分析　出处：《大连理工大学》2012年硕士论文

【摘要】：随着互联网技术的迅速发展,网页已成为最重要的信息资源,但随之而来的是“信息爆炸”的问题。除了描述网页主题的正文信息外,Web网页中往往还包含一些导航条、广告链接及版权等垃圾信息,如何在浩瀚的网络中快速而又准确地找到用户所需要的信息是一个迫切需要解决的问题。文摘是对文本的浓缩与提炼,读者可以通过对文摘的阅读来有效地决定是否有必要阅读全文,从而节省宝贵的时间与精力。 Web自动文摘的基础是Web正文信息的抽取,同时它也是信息检索、文本挖掘等其他Web信息处理工作的基础。在总结与分析现有研究方法的基础上,本文统计分析了主题网页的正文信息特征及页面结构特征,提出了一种结合HTML标签与网页正文信息特征的主题网页正文信息抽取方法。首先将Web页面解析成DOM树,根据正文信息特征获取正文信息块在页面DOM树中的位置,进一步分析正文信息块块内噪音信息的特点,去除块内噪音信息。该方法不需事先进行样本学习,具有一定适应能力,且考虑了噪音的处理,抽取准确率高。在此基础上,结合基于理解生成文摘的方法与基于结构的自动文摘方法,针对主题句抽取时完整性差的问题,在对文本进行主题分割的基础上,为每个子主题构建句子关系图,采用基于图的PageRank算法对分别每个关系图中的句子排序,按照一定的抽取规则获取每个子主题的主题句,该方法确保了抽取出的句子是对文本中每个主题语义覆盖最广的句子。文章最后设计并实现了一个Web文摘抽取系统,选用网络上的真实语料进行实验,并将实验结果与现有类似方法进行比较与分析。首先对本文提出的Web正文抽取方法的实验分析,选取来自5个不同网站的500个网页进行实验,用准确率及召回率两个指标对实验结果进行评价与分析。然后对文摘抽取方法进行评价分析,实验表明,本文提出的算法抽取准确率高、主题覆盖性好。
[Abstract]:With the rapid development of Internet technology, web pages have become the most important information resource, but the problem of "information explosion" follows. How to find the information users need quickly and accurately in the vast network is an urgent problem, such as advertising links and copyrights. Readers can effectively decide whether it is necessary to read the full text by reading abstracts, thus saving valuable time and energy. The basis of Web automatic abstract is the extraction of Web text information, and it is also the basis of other Web information processing work, such as information retrieval, text mining and so on. In this paper, the text information features and page structure features of theme pages are statistically analyzed, and a method of extracting subject page text information combining HTML tags and page text information features is proposed. Firstly, the Web pages are parsed into DOM tree. According to the feature of text information, the position of the block of text information in the DOM tree of the page is obtained, and the characteristics of the noise information in the block of text information are further analyzed, and the noise information in the block is removed. Considering the noise processing, the extraction accuracy is high. On the basis of this, combining the method of generating abstract based on understanding with the method of automatic summarization based on structure, the problem of poor integrity of topic sentence extraction is discussed. On the basis of topic segmentation of text, sentence relation graph is constructed for each sub-topic. PageRank algorithm based on graph is used to sort sentences in each relational graph, and the topic sentences of each sub-topic are obtained according to certain extraction rules. This method ensures that the extracted sentence is the one with the widest semantic coverage for each topic in the text. In the end of this paper, a Web abstract extraction system is designed and implemented. The experimental results are compared and analyzed with the existing similar methods. Firstly, the experimental analysis of the Web text extraction method proposed in this paper is carried out, and 500 web pages from five different websites are selected for the experiment. The experimental results are evaluated and analyzed by using the two indexes of accuracy and recall, and then the abstract extraction method is evaluated and analyzed. The experimental results show that the proposed algorithm has high accuracy and good topic coverage.
【学位授予单位】：大连理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】