基于DHT的分布式价格搜索引擎研究

发布时间：2019-01-26 21:01

【摘要】：近年来,随着网络资源的多样化和人们对专有领域信息的需求,垂直搜索引擎的研究越来越受到人们的关注。面向价格的搜索就是垂直搜索引擎中的一种。但现有的价格搜索引擎几乎都是基于集中式的,当大量用户在同一时间进行请求时,中央服务器就会成为“瓶颈”且容易出现单点故障。随着网络规模的不断扩大,对分布式垂直搜索的研究显得越来越重要。本文将P2P技术与垂直搜索引擎相结合,设计了一个基于DHT的分布式价格搜索引擎,并讨论了主题爬虫的爬行策略、利用URL规则对网页的主题相关性进行判断以及利用XPath技术对web信息进行抽取。然后讨论了如何利用DHT的思想实现索引的构建和分布式存储,有效的避免了集中式索引可能出现的问题。最后,针对现有的价格搜素引擎存在的搜索结果呈现结构不清晰、混乱的问题,本文提出了对搜索结果进行聚类的想法。通过对现有聚类算法的研究和分析,本文对k-means算法进行了改进,并利用改进后的算法对搜索结果进行聚类,使得簇内的文档相似度较高,簇间的文档相似度较低。然后每个簇都用类标签进行描述,用户只需根据类标签查看自己感兴趣的信息即可,而无需对所有的返回结果进行逐个浏览,大大节省了浏览时间和查找时间。
[Abstract]:In recent years, with the diversification of network resources and people's demand for proprietary domain information, the research of vertical search engine has attracted more and more attention. Price-oriented search is one of the vertical search engines. However, most existing price search engines are based on centralized search engines. When a large number of users make requests at the same time, the central server becomes a "bottleneck" and is prone to a single point of failure. With the expansion of network scale, the research of distributed vertical search becomes more and more important. This paper combines P2P technology with vertical search engine, designs a distributed price search engine based on DHT, and discusses the crawling strategy of topic crawler. URL rules are used to judge the relevance of web pages and XPath technology is used to extract web information. Then it discusses how to use the idea of DHT to realize index construction and distributed storage, which can effectively avoid the possible problems of centralized index. Finally, aiming at the problem that the search results of the existing price search engine are not clear and confusing, this paper puts forward the idea of clustering the search results. Through the research and analysis of the existing clustering algorithms, this paper improves the k-means algorithm, and makes use of the improved algorithm to cluster the search results, which makes the document similarity within the cluster is higher, and the document similarity between the clusters is lower. Then each cluster is described by class tags. Users only need to view the information they are interested in according to the class tag, without having to browse all the returned results one by one, which greatly saves the browsing time and searching time.
【学位授予单位】：西华大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】