基于Nutch的聚类搜索引擎的研究与实现

发布时间：2018-11-11 14:14

【摘要】：在互联网蓬勃发展的今天，网络信息呈指数式增长。面对海量的网络信息，如何以最快捷、准确的方式获取信息，也许是每一个网民最大的需求。在这种情况下，谷歌、百度、雅虎等搜索引擎顺势而生，为网民获取信息打开了通路。但是，传统的搜索引擎远非完美，其以线性列表的方式显示搜索结果，给网民快速获、准确地取信息带来了困难。因此，研究者们将文本聚类引入到对搜索引擎返回结果进行分析的过程中，以帮助用户快速找到所求。本文的研究工作主要围绕如何提高聚类质量和聚类算法计算效率展开。具体做法是从非负矩阵分解算法、向量空间模型、后缀数组排序和中文分词模块四个方面着手，对中文聚类算法的关键技术进行深入的研究，并以Lingo聚类算法为原型，研究提出了一种用于对中小规模文档集进行聚类分析的中文聚类算法Rlingo。本文所做的主要工作是：第一、首次将基于板仓-斋藤散度的非负矩阵分解引入到聚类分析中，提高了聚类标签的可读性和聚类结果的整体质量；第二、将位置因素和词性因素引入对传统的向量空间模型进行改进，进一步提高了聚类结果的质量；第三、基于线性后缀数组排序算法：skew算法，提出了一种能消除无实际意义特征词对特征抽取质量干扰的改进型skew后缀数组排序算法，减少了聚类算法对中小规模文档集进行聚类分析的处理时间；第四、基于Nutch，利用Rlingo实现了一个面向旅游的聚类系统，系统性能基本达到预期效果。最后，，本文设置了对照实验，比较了Rlingo、Lingo、K-means和STC的综合性能。实验表明：Rlingo聚类算法对中小文档集的聚类结果明显优于其他三种聚类算法，改进的聚类算法基本达到预期效果。
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3;TP311.13

【参考文献】