搜索引擎系统网页消重的研究与实现.pdf 全文
本文关键词:搜索引擎系统网页消重的研究与实现,,由笔耕文化传播整理发布。
中南民族大学
硕士学位论文
搜索引擎系统网页消重的研究与实现
姓名:范小源
申请学位级别:硕士
专业:计算机应用技术
指导教师:陆际光
20070520- I -
Internet? URL- II - Windows? JavaLucene??Lucene- III -
Abstract
The rapid popularization and development of Internet makes people face a sea of
information. It becomes essential to obtain really important informat ion from it. The
search engine mainly referred to the full text search system is a kind of tool that
provides this function. However, in the retrieval results from the search engine, there
are a large number of duplicated web pages which mainly come from the reproduction
among the websites. Those repetitive web pages not only occupy the network
bandwidth but also waste storage resources. Users do not want to see a pile of search
results with the same or approximate contents, and truly useful results are often
drowned in this redundant information and can’t be easily discovered. Effective
removal of those duplicate web pages will enhance the accuracy in searching and save
time and energy for users, so that the search system itself can save a lot of storage
resources and improve work efficiencyThis paper mainly studies the problem of removing duplicated web pages for
search engine. At present the effective methods of removing duplicated web pages are
still few, and most of them are realized in the server end, it means duplicated web
pages are dispeled during the process of collecting web pages. At present the common
used methods are the method based on the same URL, the method based on cluster,
the method based on feature codes and the method based on signature. In the method
based on cluster, a text is expressed as a vector in a vector spatial model, then various
methods are used to achieve clustering or classification. In this method calculating
the angle between vectors has high computational complexity which will take up more
proce
本文关键词:搜索引擎系统网页消重的研究与实现,由笔耕文化传播整理发布。
本文编号:159480
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/159480.html