搜索引擎中重复网页检测算法研究

发布时间：2018-05-12 01:15

本文选题：搜索引擎 + 重复网页检测　；参考：《河南工业大学》2012年硕士论文

【摘要】：随着因特网的普及和快速发展，网络信息以指数级速度快速增长，搜索引擎成为用户在海量网络资源中查找需求信息的有效工具。但是由于网络信息发布没有明确统一的规范，而且发布信息比较容易，造成因特网上存在有大量内容重复和近似重复的网页。这些重复网页会给搜索引擎带来诸多弊端，如影响用户体验，浪费抓取和存储资源，增大倒排索引表和降低检索效率等，因此重复网页检测技术可以有效提高搜索引擎的质量。近年来，各大搜索引擎公司和国内外学者提出了多种重复网页检测算法，如基于特征码的算法、I-Match算法、基于特征项的重复网页检测算法和DSC重复网页检测算法等。论文对现有的重复网页检测算法进行详细分析发现，这些算法的共同思想是首先从文本中抽取一定信息，其次利用抽取出的信息进行相似性判定。不同算法在具体抽取文本信息时的策略不同，导致计算相似性时的方法不同。并且有些算法为了提高计算的效率，对抽取的文本信息进行压缩处理。可见能否从文本内容中抽取有效信息准确表征文本是影响重复网页检测技术性能的关键因素。论文对两种经典的重复网页检测算法进行了详细的分析，并对其中存在的不足进行改进，主要研究内容如下：（1）基于DSC重复网页检测算法的改进 DSC(Digital Syntactic Clustering)算法是用于重复网页检测的经典算法，其基本思想是将文本切分成一定数量的shingles，然后选取一定的shingles参与相似性比较。该算法的缺点是在选取shingles时是随机的，并没有充分利用文本的内容特征。针对算法的不足，改进算法维护一个特征项的集合，选取含有特征项的shingles，这样参与相似性比较的shingles能更好的利用文本的结构特征和内容特征。（2）基于特征项的重复网页检测算法的改进基于特征项的重复网页检测算法首先利用传统信息检索中的TFIDF算法抽取文本的特征项，将文本表示成特征项的空间向量，然后利用余弦公式判定相似性。TFIDF算法的缺点是在计算特征项的权重时没有考虑特征项在文本中的位置信息。通过对网页的观察发现，，网页文本的内容较短，较多含有标题，并且标题是内容的高度概括。利用这一特点，对TFIDF算法进行改进，对在文本标题中出现的特征项的权重进行了增强。（3）改进算法的性能评估实现了一个基于开源索引检索工具Lucene的搜索引擎原型系统，对改进算法进行性能验证。实验结果表明，改进算法在重复网页识别的查全率和查准率方面较原算法都有所提升。
[Abstract]:With the popularization and rapid development of the Internet, the network information is growing exponentially, and the search engine has become an effective tool for users to find the demand information in the massive network resources. However, there is no clear and uniform specification for the information release on the Internet, and it is easy to publish the information, which results in the existence of a large number of web pages with repeated content and similar duplication on the Internet. These repeated pages will bring many disadvantages to search engine, such as affecting user experience, wasting grab and storage resources, increasing inverted index table and reducing retrieval efficiency, etc. Therefore, duplicate page detection technology can effectively improve the quality of search engine. In recent years, various search engine companies and scholars at home and abroad have proposed a variety of duplicate page detection algorithms, such as signature based algorithm I match algorithm, feature based repeat page detection algorithm and DSC repeat page detection algorithm and so on. Through the detailed analysis of the existing repeated page detection algorithms, it is found that the common idea of these algorithms is to extract some information from the text first, and then to use the extracted information to determine the similarity. Different algorithms have different strategies for extracting text information, which leads to different methods for computing similarity. In order to improve the computational efficiency, some algorithms compress the extracted text information. It can be seen that extracting effective information from text content accurately represents the text is the key factor to affect the performance of duplicate page detection technology. In this paper, two classical algorithms of duplicate page detection are analyzed in detail, and the shortcomings are improved. The main contents are as follows: 1) an improved algorithm for duplicate web page detection based on DSC DSC(Digital Syntactic clustering algorithm is a classical algorithm for repeated web page detection. Its basic idea is to divide the text into a certain number of shingles, and then select a certain shingles to participate in similarity comparison. The disadvantage of this algorithm is that it is random in selecting shingles and does not make full use of the content features of the text. In view of the deficiency of the algorithm, the improved algorithm maintains a set of feature items and selects Shingleses with feature items, so that the shingles which takes part in the similarity comparison can make better use of the structural features and content features of the text. Improvement of the algorithm of duplicate Web Page Detection based on feature item Firstly, the TFIDF algorithm of traditional information retrieval is used to extract the feature items of the text, and the text is represented as the spatial vector of the feature item. Then the disadvantage of using cosine formula to determine similarity. TFIDF algorithm is that the location information of feature items in text is not considered when calculating the weight of feature items. Through the observation of the web page, it is found that the content of the page text is shorter, the content contains more titles, and the title is the high generalization of the content. Using this feature, the TFIDF algorithm is improved, and the weight of the feature items appearing in the text title is enhanced. Performance evaluation of improved algorithm A prototype system of search engine based on open source index retrieval tool Lucene is implemented to verify the performance of the improved algorithm. The experimental results show that the improved algorithm can improve the recall rate and precision rate of duplicate page recognition compared with the original algorithm.
【学位授予单位】：河南工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】