基于网页正文结构树的近似网页去重算法研究

发布时间：2018-03-05 10:09

本文选题：网页去重　切入点：正文结构树　出处：《重庆大学》2013年硕士论文　论文类型：学位论文

【摘要】：据美国计算机协会统计，重复网页数量约占网页总量的30%-45%。伴随搜索引擎数量不断增加，用户对搜索引擎体验要求的提高，，搜素质量成为各搜索引擎赢取用户的砝码。搜索引擎若能够及时去除这些重复网页，系统不仅能节省大量存储空间，间接降低设备采购成本，也能提高网络的检索质量和访问效率，提高用户体验满意率。网页正文内容的特征提取以及大规模相似性比较是网页去重的关键问题。按照传统算法的各自突出特点将其分为三类：基于URL去重算法，仅能根据URL地址去除完全重复网页；基于特征串匹配去重算法，具有较高的准确率，但去重时间消耗高；基于聚类去重算法，具有较高的召回率，对于一些新闻题材或模板类文章准确率较低。分析转载网页发现，重复网页在内容上可能有变化，但文档格式较少发生改变，即网页正文结构几乎不变。针对此特点，本文提出基于正文结构树的两个去重算法。通过分析重复网页发现，长句不具有主题代表性。面对网页采集器更改规则，越长的句子表现越脆弱。本文对基于正文结构及长句去重算法进行改进，提出基于正文结构树及关键句的算法。算法中提取包含关键词的句子作为特征句，且特征句的数目由段落长度决定，使得提取的特征句的数目更全面的概括文章内容。实验表明，改进算法去重准确率、召回率都有所提高。特征项的粒度越小，散列后的特征指纹越不易被干扰。依据此特性，本文提出了基于正文结构树及特征串的去重算法。首先，此算法中提取网页中高频标点所在句子中的首尾汉字作为特征码。其次，利用Bloom Filter算法获取特征指纹。最后，按层次指纹进行相似度判别。实验表明，此算法在召回率方面有大幅度提高，在对小文档去重上表现的尤其明显，且大大降低了去重时间。
[Abstract]:According to the American computer Association, the number of duplicate pages accounts for about 30-45% of the total web pages. With the constant increase in the number of search engines, users' requirements for search engine experience have increased. Search quality becomes a weight for search engines to win users. If search engines can remove these duplicate pages in time, the system can not only save a lot of storage space, but also indirectly reduce the cost of purchasing equipment. It can also improve the retrieval quality and access efficiency of the network, and improve the satisfaction rate of the user experience. The feature extraction and large-scale similarity comparison of the text of the page are the key problems of the webpage removal. According to the outstanding characteristics of the traditional algorithms, they can be divided into three categories: based on the URL algorithm, only the complete duplicate pages can be removed according to the URL address; The algorithm based on feature string matching has higher accuracy rate, but high time consumption; based on clustering de-duplication algorithm, it has a higher recall rate, and low accuracy for some news or template articles. After analyzing the reprinted pages, it is found that there may be changes in the content of the reprinted pages, but the format of the document is seldom changed, that is, the structure of the text is almost unchanged. In view of this feature, this paper proposes two de-reduplication algorithms based on the text structure tree. Through the analysis of repeated web pages, it is found that long sentences are not representative of the subject. The longer the sentence is, the more vulnerable it is to change the rules of the page collector. In this paper, we improve the algorithm based on the text structure and the long sentence de-duplication algorithm. This paper proposes an algorithm based on text structure tree and key sentences, in which sentences containing keywords are extracted as feature sentences, and the number of feature sentences is determined by paragraph length. The experimental results show that the improved algorithm improves the accuracy rate and the recall rate. The smaller the granularity of the feature item, the less easily the feature fingerprint will be disturbed. According to this characteristic, a new algorithm based on text structure tree and feature string is proposed. In this algorithm, the first and last Chinese characters in the sentences with high frequency punctuation are extracted as the signature. Secondly, the Bloom Filter algorithm is used to obtain the feature fingerprint. Finally, the similarity is judged according to the hierarchical fingerprint. The experiment shows that, The algorithm has a significant increase in recall rate, especially for small documents, and greatly reduces the time of de-reduplication.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092

【相似文献】