校园网搜索引擎中网页去重技术的研究

发布时间：2019-06-15 20:14

【摘要】：随着校园网建设的迅速发展，校园网信息资源迅速增加，这使得全校师生迅速定位有价值的信息难度较大，浪费时间而且效率低下。基于校园网自身的特点，发展较成熟的通用搜索引擎不能完全适用于校园网，并且大量转载网页的存在造成检索结果重复页过多。通过分析校园网网页的特点和现有去重技术，以解决校园网搜索引擎检索结果重复网页过多问题，针对不同类型的重复网页，采用在索引和实时检索时分别去重的策略，构建了校园网搜索引擎，完成了如下几项工作：第一，对网页去重的准备工作进行了研究和分析。首先，分析网页噪音产生的原因、噪音的定义及类型，采用合并内容块技术对原始网页集进行噪音去除和正文抽取，以获得网页的正文内容。其次，研究中文分词技术，对比现有中文分词技术，最终采用庖丁解牛分词软件，对Nutch进行二次开发——修改Nutch源码，实现中文分词。第二，对索引时网页去重算法进行研究和改进。分析现有算法，针对完全重复或部分重复的网页，采用最长段落签名的网页去重算法。首先对整篇文档签名后去重，其次对去重过滤后的文档分段，对分好的段落排序，再取前N个段落对其进行指纹签名，将其作为文档的特征，当这两个文档中相同段落数超过系统给定的一个阈值时，就判定这两个文档为相互重复的文档。提取前N段并进行指纹排序大大降低了计算的复杂度。实验证明，该方法有较高的去重准确率。第三，针对网页转载时对原网页进行微小修改而产生的重复网页，在实时检索时采用优化傅立叶变换去重算法。该算法把每篇文档的每个词映射成一个数值Fingerprint，那么每篇文档就可以表示成一个离散数值序列。对该数值序列进行傅立叶变换得到傅立叶系数，比较系数的前若干项即可大致比较出两个数列的相似性。实验证明，基于优化傅立叶变换的去重算法能够在网页发生修改的时候兼顾查全率和去重率。以Nutch作为系统的开发工具，，通过对Nutch源码进行修改实现索引时的去重算法，并采用插件形式实现检索时的网页去重算法，在Nutch的基础上设计实现校园网搜索引擎，并详细说明了校园网搜索引擎系统开发过程和方法。最后对提出的去重策略进行实验性能测试，采用Nutch爬取校园网网页作为实验的数据集，结果表明将两种算法结合的去重策略提高了搜索结果的精确度和去重的准确率，并且搭建的校园网搜索引擎系统能够有效的、正常的运行。
[Abstract]:With the rapid development of campus network construction, campus network information resources increase rapidly, which makes it difficult for teachers and students to locate valuable information quickly, waste time and low efficiency. Based on the characteristics of campus network, the developed general search engine can not be fully suitable for campus network, and a large number of reprinted web pages cause too many repeated pages of retrieval results. By analyzing the characteristics of campus network web pages and the existing de-emphasis technology, in order to solve the problem of excessive repeated web pages in campus network search engine retrieval results, aiming at different types of repeated web pages, the campus network search engine is constructed by using the strategy of index and real-time retrieval, and the following work has been done: first, the preparation of web pages is studied and analyzed. Firstly, the causes of web page noise, the definition and type of noise are analyzed, and the noise removal and text extraction of the original web page set are carried out by using the merged content block technology to obtain the text content of the web page. Secondly, the Chinese word segmentation technology is studied, and the existing Chinese word segmentation technology is compared. finally, the second development of Nutch is carried out, which is to modify the Nutch source code and realize the Chinese word segmentation. Secondly, the algorithm of web page de-weight in index is studied and improved. The existing algorithms are analyzed, and the longest paragraph signed page de-repetition algorithm is used for completely repeated or partially repeated web pages. Firstly, the whole document is signed, then the filtered document is segmented, the divided paragraphs are sorted, and then the first N paragraphs are fingerprint signed as the characteristics of the document. When the same number of drops in the two documents exceeds a threshold given by the system, the two documents are determined to be duplicated documents. The computational complexity is greatly reduced by extracting the first N segments and sorting the fingerprints. The experimental results show that the method has a high accuracy of weight removal. Thirdly, the optimized Fourier transform algorithm is used in real-time retrieval for repeated web pages which are slightly modified when the page is reprinted. The algorithm maps each word of each document to a numerical Fingerprint, so that each document can be represented as a discrete numerical sequence. The Fourier coefficients are obtained by Fourier transform of the numerical sequence, and the similarities between the two sequences can be roughly compared by comparing the first several terms of the coefficients. The experimental results show that the algorithm based on optimized Fourier transform can take into account the recall rate and the recall rate when the web page is modified. Taking Nutch as the development tool of the system, the algorithm of index is realized by modifying the source code of Nutch, and the algorithm of web page de-weight is realized in the form of plug-in. The campus network search engine is designed and implemented on the basis of Nutch, and the development process and method of campus network search engine system are explained in detail. Finally, the experimental performance of the proposed weight removal strategy is tested, and the Nutch crawling campus network web page is used as the experimental data set. The results show that the combination of the two algorithms improves the accuracy of search results and the accuracy of weight removal, and the campus network search engine system can run effectively and normally.
【学位授予单位】：内蒙古科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】