网络搜索引擎的相关技术研究

发布时间：2018-03-22 03:34

本文选题：搜索引擎　切入点：索引建立　出处：《山东科技大学》2011年硕士论文　论文类型：学位论文

【摘要】：网络搜索引擎(如google、百度等)作为一种特殊的信息检索系统,其特殊之处就在于它的检索范围是针对整个Web网页资源的。互联网上的信息资源数量巨大、并且处于不断地变化更新当中,最为重要的是Web网页文档本身是半结构化或无结构的,其中经常包含导航、广告信息、无用链接等与网页主题无关的内容,其复杂程度远高于普通的文本文档。而一般的信息检索系统(如文献检索系统)大部分都是基于空间向量模型而设计的,无法适应Web资源的以上特性,这就使得网络搜索引擎与基于空间向量模型的信息检索系统在工作原理上会出现很大的不同。本文着重从索引建立、查询扩展、相关网页排序三个方面来阐述它们的不同之处。本文的主要内容是：详细介绍了网络搜索引擎索引组织结构,并针对Web网页中包含有大量无关信息如广告、导航等影响索引的效率的问题,给出了网页预处理及文本提取的实现算法,去除了Web网页文档中的重复网页、噪音内容以及噪音链接,提高了搜索引擎的索引效率。本文提出了结合用户兴趣和服务器端日志发掘的相关搜索的实现算法。针对传统PageRank算法会出现“主题漂移”现象,带来许多与用户所需信息无关的噪声信息的缺点,本文提出了基于页面主题相关性的PageRank算法,从网页的超链接、网页内容以及用户点击行为三个方面判断网页文档与查询主题相关性,进而避免出现过多的与检索主题不相关的网页信息。最后本文提出了一种自动摘要的实现算法,通过计算网页文档中每个句子的权重值,得到最能表达该网页主题内容的句子作为摘要反馈给用户,方便用户直观、快速地获取网页文档的主题内容,从而不断改进搜索关键词,检索出所需要的网页信息。
[Abstract]:Network search engine (such as Google, Baidu, etc.) as a special information retrieval system, its particularity lies in its search range is for the entire Web web resources. The number of information resources on the Internet is huge, and is updated constantly changes, the most important is the Web page document itself is semi structured or no structure, which often contain navigation, advertising information, useless links "has nothing to do with the theme of the content, its complexity is much higher than ordinary text document. General information retrieval systems (such as document retrieval system) are mostly designed based on vector space model, can not adapt to these characteristics of Web resources, which makes web search engine based on vector space model of information retrieval system will be very different in principle. This paper focuses on indexing, query expansion, related network Page sorting has three aspects to illustrate their differences.
The main contents of this paper are: introduces the network search engine index structure, and according to the Web web page contains a large number of irrelevant information such as advertising, navigation and other effects of the index efficiency, realize the algorithm of web page preprocessing and text extraction, removal of duplicated web pages Web web page in the document, content and noise link noise, improve the index efficiency of search engine. This paper presents an algorithm to search relevant user interest and server log mining. Traditional PageRank algorithm will appear "topic drift" phenomenon, bring a lot of noise information and user information required is independent of the shortcomings, this paper proposes a PageRank algorithm for page topic relevance based on the links on the web, the three aspects of web content and web user click behavior judgment document and query relevance, and avoid "Not too much information related with retrieval subject. Finally this paper put forward an algorithm to realize automatic summarization, by calculating the weight of each sentence in the document of" value, get the best expression of the theme of "sentences as the feedback to the user, direct view of the convenience of users, subject content quickly obtain document thus, continuous improvement of search keywords to retrieve the information needed."

【学位授予单位】：山东科技大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：G354

【参考文献】