网页消重技术的研究与实现

发布时间：2018-02-27 01:05

本文关键词： 网页消重字频分段编辑距离特征串　出处：《电子科技大学》2012年硕士论文　论文类型：学位论文

【摘要】：随着Internet的发展及其广泛应用，网络信息呈爆炸式增长，互联网已经成为了人们获取信息的重要来源。为了能帮助人们快速找到所需要的信息，于是便有了搜索引擎技术。方便了人们查找信息，节省了时间，已经成为了人们经常使用的一项网络服务。但是据中国互联网信息中心统计报告显示，重复结果太多是用户在使用搜索引擎时遇到的主要问题。据统计，Internet上大约有30%左右的重复网页，大部分是由于转载造成的。网页重复问题对搜索引擎带来了一定的影响，重复网页不仅浪费了存储空间，也增加了搜索引擎的处理时间。同时搜索引擎的检索结果包含了很多内容重复的网页，降低了检索质量，所以网页消重已经成为搜索引擎中一项必不可少的工作。本文研究了网页消重的起源、及其发展现状，进行了以下几方面的研究工作：（1）高质量的网页消重都是基于网页正文文本的，本文首先研究了网页的内部结构，提出了基于DOM的网页正文抽取算法，通过将网页分块，聚合，过滤，得到网页的正文文本，将其作为消重的对象。实验证明该算法具有较高的准确率。（2）设计了一个在线网页消重系统，实现了两种消重算法：摘要消重和全文消重。该系统通过对搜索引擎的检索结果进行消重处理，提高了检索质量。（3）提出了两种网页消重算法：基于字频特征的消重算法和基于分段特征的消重算法。（4）基于字频的消重算法抽取网页正文文字的字频作为网页主特征串，同时将字频的附加信息作为网页的辅特征串。算法使用编辑距离树对特征串进行比较，减少了两两比较的次数，与传统算法相比提高了算法效率。（5）基于分段的消重算法将网页正文分段，提取每段中最长的句子作为其特征串，运用HASH算法进行消重。该算法的准确率较高且效率非常理想。（6）最后将上述两种算法与基于标点的消重算法从算法效率，准确率，召回率三方面进行了严格的比较，并分析了三种算法的缺陷和优点。
[Abstract]:With the development of Internet and its wide application, the Internet has become an important source for people to obtain information. So there is search engine technology, which makes it convenient for people to find information, saves time, and has become a network service that people often use. However, according to the statistics report of the China Internet Information Center, too many duplicate results are the main problems that users encounter when using search engines. According to statistics, there are about 30% duplicate web pages on the Internet. Most of it is caused by reprinting. The problem of web page duplication has a certain impact on search engine, duplicate pages not only waste storage space, It also increases the processing time of the search engine. Meanwhile, the search results of the search engine contain a lot of duplicate pages, which reduce the search quality, so the web page weightlessness has become an essential work in the search engine. In this paper, the origin and development of web page weightlessness are studied, and the following research work is carried out:. First of all, this paper studies the internal structure of the web page, and puts forward a text extraction algorithm based on DOM. By dividing, aggregating and filtering the web page, the text of the page can be obtained. The experimental results show that the algorithm has high accuracy. 2) an online webpage weight-elimination system is designed, which realizes two weight-elimination algorithms: summary weight-elimination and full-text weight-elimination. The system improves the retrieval quality by eliminating the search results. 3) two algorithms are proposed: one is based on word frequency feature and the other is based on segment feature. 4) the word frequency of the text text is extracted as the main feature string of the web page, and the additional information of the word frequency is taken as the auxiliary feature string of the web page. The algorithm uses the edit distance tree to compare the feature string. The frequency of pairwise comparison is reduced, and the efficiency of the algorithm is improved compared with the traditional algorithm. 5) based on the segmentation algorithm, the text of the web page is segmented, the longest sentence in each segment is extracted as its feature string, and the HASH algorithm is used to eliminate the weight. The algorithm has high accuracy and high efficiency. Finally, the efficiency, accuracy and recall rate of these two algorithms are compared strictly with those based on punctuation. The defects and advantages of the three algorithms are analyzed.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【相似文献】