基于字集特征向量的网页消重改进算法
发布时间:2019-06-29 07:45
【摘要】:基于MD5算法计算数字指纹的网页消重算法简单而高效,在网页消重领域应用比较广泛。但是由于MD5算法是一种严格的信息加密算法,在文章内容变动很少的情况下得出的指纹结果完全不同,导致基于这种算法的网页消重技术召回率不是很高。提出了两种基于字集特征向量的网页消重改进算法,把文章内容映射到字集空间中去,计算字集空间距离来判断文章是否相似。提出的算法具有良好的泛化能力,段落中存在的调整语序和增删改个别字不会影响到对相似段落的识别,大大提高了网页消重算法的召回率。实验结果表明,算法的时间复杂度为O(n),空间复杂度为O(1),适合应用于大规模网页消重。
[Abstract]:The algorithm of web page weight elimination based on MD5 algorithm is simple and efficient, and it is widely used in the field of web page weight elimination. However, because MD5 algorithm is a strict information encryption algorithm, the fingerprint results are completely different when the content of the article changes little, which leads to the recall rate of web page weight cancellation technology based on this algorithm is not very high. In this paper, two improved algorithms of web page weight elimination based on character set eigenvector are proposed, in which the content of the article is mapped to the word set space, and the spatial distance of the word set is calculated to judge whether the article is similar or not. The proposed algorithm has good generalization ability. The adjustment of word order and the addition and deletion of words in paragraphs will not affect the recognition of similar paragraphs, and greatly improve the recall rate of web page weight elimination algorithm. The experimental results show that the time complexity of the algorithm is O (n), space complexity O (1), which is suitable for large-scale web page weight elimination.
【作者单位】: 中国石油大学(北京)计算机系;
【基金】:国家“十五”科技攻关项目(No.2001BA605A09)
【分类号】:TP393.092
本文编号:2507666
[Abstract]:The algorithm of web page weight elimination based on MD5 algorithm is simple and efficient, and it is widely used in the field of web page weight elimination. However, because MD5 algorithm is a strict information encryption algorithm, the fingerprint results are completely different when the content of the article changes little, which leads to the recall rate of web page weight cancellation technology based on this algorithm is not very high. In this paper, two improved algorithms of web page weight elimination based on character set eigenvector are proposed, in which the content of the article is mapped to the word set space, and the spatial distance of the word set is calculated to judge whether the article is similar or not. The proposed algorithm has good generalization ability. The adjustment of word order and the addition and deletion of words in paragraphs will not affect the recognition of similar paragraphs, and greatly improve the recall rate of web page weight elimination algorithm. The experimental results show that the time complexity of the algorithm is O (n), space complexity O (1), which is suitable for large-scale web page weight elimination.
【作者单位】: 中国石油大学(北京)计算机系;
【基金】:国家“十五”科技攻关项目(No.2001BA605A09)
【分类号】:TP393.092
【相似文献】
相关硕士学位论文 前1条
1 张玉琴;一类数字集及直和数字集下自仿测度的谱性[D];陕西师范大学;2013年
,本文编号:2507666
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2507666.html