近似镜像网页去重方法研究

发布时间：2018-05-30 21:27

本文选题：近似镜像网页 + Simhash　；参考：《东华大学》2017年硕士论文

【摘要】：随着信息技术的飞速发展,互联网上的网页数据呈现出爆炸式的增长态势,大量近似镜像网页的存在已经成为人们快速获取有效讯息的最大阻碍。为了解决搜索中网络上存在大量重复网页的问题,研究人员提出了多种近似镜像网页去重算法,在普通的信息检索过程中取得了较好的去重效果,但是在网页噪声抵抗方面的表现并不令人满意。对于一些实时性高的新闻类网页,这些算法常出现误判,算法的稳定性不高。针对上述问题,尝试了两种基于Simhash的网页去重算法改善网页搜索去重问题。算法一是基于Simhash的长句提取近似镜像网页去重算法,解决算法的噪声敏感问题。目前常用的网页去重算法均包含特征提取环节,存在噪声词汇,影响了网页去重算法的准确率与召回率。对网页噪声分析后发现,噪声文本长度一般都较短,通过把提取的网页文本长句作为特征词的分割范围能够有效规避网页中存在的噪声信息,减弱噪声对于算法的不利影响。算法二是基于Simhash的特殊权重比近似镜像网页去重算法,解决网页去重算法对实时性高的新闻类网页进行去重时常出现误判的问题。由于Simhash算法给予特征词的权重是依据简单的词频统计来操作的,对于同一类别的新闻网页,网页文本常常相似,只在时间与地点上有所不同,这导致Simhash算法提取的特征词与其对应的权重都是相似的,最终造成了结果的误判。基于Simhash的特殊权重考虑了核心词汇因素,对于新闻中的核心词汇赋予其额外的权重比,增强其对于文本指纹值的影响力,使得两个核心词汇相差较大的网页能够被区分出来。最后,结合实际需求,将本文提出的两种算法运用到了自贸区企业动态信息系统中的网页去重模块中,通过实践证明了算法的科学性与有效性。
[Abstract]:With the rapid development of information technology, the data of web pages on the Internet show an explosive growth trend. The existence of a large number of approximate mirror pages has become the biggest obstacle for people to obtain effective information quickly. In order to solve the problem that there are a large number of duplicate web pages in the search network, researchers have proposed a variety of approximate image page de-duplication algorithms, which have achieved better results in the common information retrieval process. However, the performance of the web noise resistance is not satisfactory. For some real-time news pages, these algorithms often appear misjudgment, and the stability of these algorithms is not high. In order to solve the above problems, two kinds of Simhash based web page de-reduplication algorithms are tried to improve the web search de-reduplication problem. The first algorithm is an approximate mirror page de-duplication algorithm based on Simhash to solve the noise-sensitive problem of the algorithm. At present, the commonly used algorithms include feature extraction and noise vocabulary, which affect the accuracy and recall rate of the algorithm. It is found that the length of the noisy text is generally short. By using the extracted long sentence of the web page as the segmentation range of the feature words, the noise information in the web page can be effectively avoided and the adverse effect of the noise on the algorithm can be reduced. The second algorithm is based on the special weight ratio of Simhash, which solves the problem that the reversion of real-time news pages is often caused by misjudgment. Because the weight given by Simhash algorithm to feature words is based on simple word frequency statistics, for the same type of news pages, the text of web pages is often similar, only in time and place. As a result, the feature words extracted by Simhash algorithm are similar to their corresponding weights, and the result is misjudged. The special weight based on Simhash takes into account the factors of core vocabulary, gives it extra weight ratio to the core words in news, enhances its influence on the fingerprint value of text, and makes the web pages with big differences between the two core words can be distinguished. Finally, combined with the actual demand, the two algorithms proposed in this paper are applied to the web page de-reduplication module in the enterprise dynamic information system of the free trade area, and the scientific and effective algorithm is proved by practice.
【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092

【相似文献】