校园网搜索引擎中网页去重技术的研究
[Abstract]:With the rapid development of campus network construction, campus network information resources increase rapidly, which makes it difficult for teachers and students to locate valuable information quickly, waste time and low efficiency. Based on the characteristics of campus network, the developed general search engine can not be fully suitable for campus network, and a large number of reprinted web pages cause too many repeated pages of retrieval results. By analyzing the characteristics of campus network web pages and the existing de-emphasis technology, in order to solve the problem of excessive repeated web pages in campus network search engine retrieval results, aiming at different types of repeated web pages, the campus network search engine is constructed by using the strategy of index and real-time retrieval, and the following work has been done: first, the preparation of web pages is studied and analyzed. Firstly, the causes of web page noise, the definition and type of noise are analyzed, and the noise removal and text extraction of the original web page set are carried out by using the merged content block technology to obtain the text content of the web page. Secondly, the Chinese word segmentation technology is studied, and the existing Chinese word segmentation technology is compared. finally, the second development of Nutch is carried out, which is to modify the Nutch source code and realize the Chinese word segmentation. Secondly, the algorithm of web page de-weight in index is studied and improved. The existing algorithms are analyzed, and the longest paragraph signed page de-repetition algorithm is used for completely repeated or partially repeated web pages. Firstly, the whole document is signed, then the filtered document is segmented, the divided paragraphs are sorted, and then the first N paragraphs are fingerprint signed as the characteristics of the document. When the same number of drops in the two documents exceeds a threshold given by the system, the two documents are determined to be duplicated documents. The computational complexity is greatly reduced by extracting the first N segments and sorting the fingerprints. The experimental results show that the method has a high accuracy of weight removal. Thirdly, the optimized Fourier transform algorithm is used in real-time retrieval for repeated web pages which are slightly modified when the page is reprinted. The algorithm maps each word of each document to a numerical Fingerprint, so that each document can be represented as a discrete numerical sequence. The Fourier coefficients are obtained by Fourier transform of the numerical sequence, and the similarities between the two sequences can be roughly compared by comparing the first several terms of the coefficients. The experimental results show that the algorithm based on optimized Fourier transform can take into account the recall rate and the recall rate when the web page is modified. Taking Nutch as the development tool of the system, the algorithm of index is realized by modifying the source code of Nutch, and the algorithm of web page de-weight is realized in the form of plug-in. The campus network search engine is designed and implemented on the basis of Nutch, and the development process and method of campus network search engine system are explained in detail. Finally, the experimental performance of the proposed weight removal strategy is tested, and the Nutch crawling campus network web page is used as the experimental data set. The results show that the combination of the two algorithms improves the accuracy of search results and the accuracy of weight removal, and the campus network search engine system can run effectively and normally.
【学位授予单位】:内蒙古科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 王建勇,谢正茂,雷鸣,李晓明;近似镜像网页检测算法的研究与评价[J];电子学报;2000年S1期
2 白广慧,连浩,刘悦,程学旗;网页查重技术在企业数据仓库中的应用[J];计算机应用;2005年07期
3 陈锦言;孙济洲;张亚平;;基于傅立叶变换的网页去重算法[J];计算机应用;2008年04期
4 董守斌;;木棉:企业级校园网搜索引擎[J];中国教育网络;2007年06期
5 孙殿哲;魏海平;陈岩;;Nutch中庖丁解牛中文分词的实现与评测[J];计算机与现代化;2010年06期
6 胡骏;李星;;校园网信息资源搜索引擎的研究与实现[J];计算机工程与设计;2006年24期
7 高家利;廖晓峰;;改进的Bloom Filter算法及其性能分析[J];计算机工程与设计;2009年03期
8 蔡建超;郭一平;王亮;;基于Lucene.Net校园网搜索引擎的设计与实现[J];计算机技术与发展;2006年11期
9 张晓滨,石美红,蔡桂洲;校园网搜索引擎设计[J];西安工程科技学院学报;2002年03期
10 鲁屹华;;校园内网搜索引擎构建的必要性分析[J];科技资讯;2012年02期
相关硕士学位论文 前10条
1 牛娟娟;搜索引擎系统中网页消重的研究与实现[D];河南大学;2011年
2 戴支荣;基于Lucene的面向主题信息搜索系统的关键技术分析及应用[D];武汉理工大学;2011年
3 唐蓉;搜索引擎重复网页检测技术研究[D];重庆理工大学;2011年
4 王慧;基于URP的校园信息化建设的研究[D];河海大学;2006年
5 刘琳;校园网搜索引擎系统的设计与实现[D];山东大学;2007年
6 于瑞萍;中文文本分类相关算法的研究与实现[D];西北大学;2007年
7 黄波;主题搜索引擎的研究与应用[D];成都理工大学;2007年
8 宁力;搜索引擎中网页查重方法的研究[D];北京化工大学;2007年
9 江慧娜;中文搜索引擎的关键技术研究[D];北京化工大学;2007年
10 曹欣;半虚拟化技术分析与研究[D];浙江大学;2008年
本文编号:2500480
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2500480.html