当前位置:主页 > 科技论文 > 搜索引擎论文 >

校园网搜索引擎中网页去重技术的研究

发布时间:2019-06-15 20:14
【摘要】:随着校园网建设的迅速发展,校园网信息资源迅速增加,这使得全校师生迅速定位有价值的信息难度较大,浪费时间而且效率低下。基于校园网自身的特点,发展较成熟的通用搜索引擎不能完全适用于校园网,并且大量转载网页的存在造成检索结果重复页过多。通过分析校园网网页的特点和现有去重技术,以解决校园网搜索引擎检索结果重复网页过多问题,针对不同类型的重复网页,采用在索引和实时检索时分别去重的策略,构建了校园网搜索引擎,完成了如下几项工作: 第一,对网页去重的准备工作进行了研究和分析。首先,分析网页噪音产生的原因、噪音的定义及类型,采用合并内容块技术对原始网页集进行噪音去除和正文抽取,以获得网页的正文内容。其次,研究中文分词技术,对比现有中文分词技术,最终采用庖丁解牛分词软件,对Nutch进行二次开发——修改Nutch源码,实现中文分词。 第二,对索引时网页去重算法进行研究和改进。分析现有算法,针对完全重复或部分重复的网页,采用最长段落签名的网页去重算法。首先对整篇文档签名后去重,其次对去重过滤后的文档分段,对分好的段落排序,再取前N个段落对其进行指纹签名,将其作为文档的特征,当这两个文档中相同段落数超过系统给定的一个阈值时,就判定这两个文档为相互重复的文档。提取前N段并进行指纹排序大大降低了计算的复杂度。实验证明,该方法有较高的去重准确率。 第三,针对网页转载时对原网页进行微小修改而产生的重复网页,在实时检索时采用优化傅立叶变换去重算法。该算法把每篇文档的每个词映射成一个数值Fingerprint,那么每篇文档就可以表示成一个离散数值序列。对该数值序列进行傅立叶变换得到傅立叶系数,比较系数的前若干项即可大致比较出两个数列的相似性。实验证明,基于优化傅立叶变换的去重算法能够在网页发生修改的时候兼顾查全率和去重率。 以Nutch作为系统的开发工具,,通过对Nutch源码进行修改实现索引时的去重算法,并采用插件形式实现检索时的网页去重算法,在Nutch的基础上设计实现校园网搜索引擎,并详细说明了校园网搜索引擎系统开发过程和方法。最后对提出的去重策略进行实验性能测试,采用Nutch爬取校园网网页作为实验的数据集,结果表明将两种算法结合的去重策略提高了搜索结果的精确度和去重的准确率,并且搭建的校园网搜索引擎系统能够有效的、正常的运行。
[Abstract]:With the rapid development of campus network construction, campus network information resources increase rapidly, which makes it difficult for teachers and students to locate valuable information quickly, waste time and low efficiency. Based on the characteristics of campus network, the developed general search engine can not be fully suitable for campus network, and a large number of reprinted web pages cause too many repeated pages of retrieval results. By analyzing the characteristics of campus network web pages and the existing de-emphasis technology, in order to solve the problem of excessive repeated web pages in campus network search engine retrieval results, aiming at different types of repeated web pages, the campus network search engine is constructed by using the strategy of index and real-time retrieval, and the following work has been done: first, the preparation of web pages is studied and analyzed. Firstly, the causes of web page noise, the definition and type of noise are analyzed, and the noise removal and text extraction of the original web page set are carried out by using the merged content block technology to obtain the text content of the web page. Secondly, the Chinese word segmentation technology is studied, and the existing Chinese word segmentation technology is compared. finally, the second development of Nutch is carried out, which is to modify the Nutch source code and realize the Chinese word segmentation. Secondly, the algorithm of web page de-weight in index is studied and improved. The existing algorithms are analyzed, and the longest paragraph signed page de-repetition algorithm is used for completely repeated or partially repeated web pages. Firstly, the whole document is signed, then the filtered document is segmented, the divided paragraphs are sorted, and then the first N paragraphs are fingerprint signed as the characteristics of the document. When the same number of drops in the two documents exceeds a threshold given by the system, the two documents are determined to be duplicated documents. The computational complexity is greatly reduced by extracting the first N segments and sorting the fingerprints. The experimental results show that the method has a high accuracy of weight removal. Thirdly, the optimized Fourier transform algorithm is used in real-time retrieval for repeated web pages which are slightly modified when the page is reprinted. The algorithm maps each word of each document to a numerical Fingerprint, so that each document can be represented as a discrete numerical sequence. The Fourier coefficients are obtained by Fourier transform of the numerical sequence, and the similarities between the two sequences can be roughly compared by comparing the first several terms of the coefficients. The experimental results show that the algorithm based on optimized Fourier transform can take into account the recall rate and the recall rate when the web page is modified. Taking Nutch as the development tool of the system, the algorithm of index is realized by modifying the source code of Nutch, and the algorithm of web page de-weight is realized in the form of plug-in. The campus network search engine is designed and implemented on the basis of Nutch, and the development process and method of campus network search engine system are explained in detail. Finally, the experimental performance of the proposed weight removal strategy is tested, and the Nutch crawling campus network web page is used as the experimental data set. The results show that the combination of the two algorithms improves the accuracy of search results and the accuracy of weight removal, and the campus network search engine system can run effectively and normally.
【学位授予单位】:内蒙古科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3

【参考文献】

相关期刊论文 前10条

1 王建勇,谢正茂,雷鸣,李晓明;近似镜像网页检测算法的研究与评价[J];电子学报;2000年S1期

2 白广慧,连浩,刘悦,程学旗;网页查重技术在企业数据仓库中的应用[J];计算机应用;2005年07期

3 陈锦言;孙济洲;张亚平;;基于傅立叶变换的网页去重算法[J];计算机应用;2008年04期

4 董守斌;;木棉:企业级校园网搜索引擎[J];中国教育网络;2007年06期

5 孙殿哲;魏海平;陈岩;;Nutch中庖丁解牛中文分词的实现与评测[J];计算机与现代化;2010年06期

6 胡骏;李星;;校园网信息资源搜索引擎的研究与实现[J];计算机工程与设计;2006年24期

7 高家利;廖晓峰;;改进的Bloom Filter算法及其性能分析[J];计算机工程与设计;2009年03期

8 蔡建超;郭一平;王亮;;基于Lucene.Net校园网搜索引擎的设计与实现[J];计算机技术与发展;2006年11期

9 张晓滨,石美红,蔡桂洲;校园网搜索引擎设计[J];西安工程科技学院学报;2002年03期

10 鲁屹华;;校园内网搜索引擎构建的必要性分析[J];科技资讯;2012年02期

相关硕士学位论文 前10条

1 牛娟娟;搜索引擎系统中网页消重的研究与实现[D];河南大学;2011年

2 戴支荣;基于Lucene的面向主题信息搜索系统的关键技术分析及应用[D];武汉理工大学;2011年

3 唐蓉;搜索引擎重复网页检测技术研究[D];重庆理工大学;2011年

4 王慧;基于URP的校园信息化建设的研究[D];河海大学;2006年

5 刘琳;校园网搜索引擎系统的设计与实现[D];山东大学;2007年

6 于瑞萍;中文文本分类相关算法的研究与实现[D];西北大学;2007年

7 黄波;主题搜索引擎的研究与应用[D];成都理工大学;2007年

8 宁力;搜索引擎中网页查重方法的研究[D];北京化工大学;2007年

9 江慧娜;中文搜索引擎的关键技术研究[D];北京化工大学;2007年

10 曹欣;半虚拟化技术分析与研究[D];浙江大学;2008年



本文编号:2500480

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2500480.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2ed10***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com