相似网页去重算法的并行化研究与实现

发布时间：2019-03-21 10:26

【摘要】：由于Web镜像和网络转载抄袭,完全重复以及近似重复的网页数据对于当前的搜索引擎产生了一系列的问题:它不仅增加了网页数据索引的存储量而且给搜索引擎的检索服务带来了沉重的负担,与此同时,充斥着重复内容的搜索结果也带给搜索引擎用户较差的用户体验。所以对于搜索引擎来说,需要使用高效的相似网页探测算法去发现并去除近似重复网页,降低网络爬虫和搜索引擎的负担。搜索引擎技术近年来得到了快速发展,相似网页去重对于Web数据的搜集至关重要。对于相似网页去重系统来说,需要识别网页的主题内容块,与此同时去掉网页中例如广告等噪声内容。网页的文本数据将根据词典进行分词并基于Shingle算法提取网页的特征向量,使用Simhash算法针对网页的特征向量计算代表网页特征的指纹。该指纹所具备的特性是如果两张网页具有相似的网页内容,那么这对指纹将具有较小的汉明距离。此外,提出了传统Shingle算法的基于映射/规约模型的并行化改进,并通过实验进行了验证。给出了一种Web网页搜集系统的实用架构以及其中相似网页去重原型系统的设计实现方法。相似网页去重原型系统具备两种工作模式:第一种为在线工作模式,在这种模式下,单条网页指纹将同海量历史指纹数据进行比对;第二种为并行处理模式,该模式下,一批网页指纹将同海量历史指纹数据进行比对,与前者不同的是,海量历史指纹数据被分割成特定的数据块存储在分布式计算平台下,两批指纹数据的汉明距离计算过程将使用映射/规约并行编程模型进行处理。经过实验验证,采用并行处理方式的原型系统有效的解决了相似网页去重问题,并且达到了较高的效率和准确率。
[Abstract]:Due to Web mirroring and network reprint plagiarism, Completely duplicated and nearly duplicated web page data creates a series of problems for the current search engine: it not only increases the storage of web data index, but also puts a heavy burden on the search engine's retrieval service, at the same time, Search results filled with repetitive content also give search engine users a poor user experience. Therefore, for search engines, it is necessary to use efficient similar web page detection algorithm to find and remove approximately repeated web pages, so as to reduce the burden of web crawlers and search engines. Search engine technology has been developed rapidly in recent years. It is very important for the collection of Web data to re-duplicate similar web pages. For similar web page reduplication systems, it is necessary to identify the topic content block of the web page and remove the noisy content such as advertisement from the web page at the same time. The text data of the web page will be segmented according to the dictionary and extracted from the feature vector of the web page based on the Shingle algorithm. The fingerprint representing the features of the web page will be calculated by using the Simhash algorithm according to the feature vector of the web page. The characteristic of the fingerprint is that if the two pages have similar web content, the fingerprints will have a smaller hamming distance. In addition, the parallelization improvement of traditional Shingle algorithm based on map / protocol model is proposed and verified by experiments. This paper presents a practical architecture of Web web page collection system and the design and implementation of a similar web page de-prototype system. Similar web page reconstruction system has two working modes: the first one is on-line work mode, in which a single web page fingerprint will be compared with a large amount of historical fingerprint data; The second is parallel processing mode, in which a batch of web fingerprints will be compared with massive historical fingerprint data. Unlike the former, massive historical fingerprint data will be divided into specific data blocks and stored in distributed computing platform. The hamming distance calculation process of two batches of fingerprint data will be processed by mapping / specification parallel programming model. The experimental results show that the prototype system using parallel processing method effectively solves the problem of similar web page deduplication and achieves high efficiency and accuracy.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2009
【分类号】：TP393.092

【参考文献】