Web结构挖掘与高维数据挖掘研究

发布时间：2018-11-12 18:56

【摘要】：数据挖掘是人工智能、机器学习、模式识别和信息决策等领域的前沿研究方向之一。随着Web的迅速发展以及数据采样能力的提升,Web挖掘和高维数据挖掘逐渐成为数据挖掘的两个重要任务。 Web是现代社会人们传播和获取信息最重要的一个平台。Web中包含的网页数量已经达到十亿的规模,并且仍在与日剧增,Web包含的信息量更是呈现爆炸式的增长。由于Web中的信息是非结构化和自组织的,传统的信息检索技术很难在实际需求中得到有效的应用。除了Web页面以外,Web中还有大量的超链接。超链接蕴含了对网页的重要性评价信息,因此Web结构挖掘(即Web链接分析)成为提高Web信息检索质量最重要的途径。聚类分析是数据挖掘的基本方法之一,在许多领域都有着广泛的应用。近年来很多聚类问题中的数据普遍呈现出高维特征。而已有的经典聚类方法都是基于低维数据空间的假设,不能对高维数据进行有效聚类。高维数据聚类问题成为目前聚类分析研究的重点。流形聚类是近年来发展起来并被广泛研究的一种高维数据聚类分析方法。本文针对数据挖掘中的Web结构挖掘和高维数据聚类两个典型问题,研究分析了基于链接分析的搜索引擎页面排序算法、Web社区发现算法、流形聚类中的有效相异度度量以及针对大规模高维数据流形聚类的低秩逼近问题,主要贡献包括： (1)分析了基于链接分析的页面排序算法PageRank算法和HITS算法的特点,提出了基于多级衰减模型的PageRank算法框架,根据衰减模型来分配页面间的直接链接和间接链接的权值,提高了查询的精确度；提出了基于页面相似度和链接流行度的HITS改进算法,根据页面间相对于查询主题的相似度以及页面间链接的流行度来分配链接的权值,有效缓解了HITS算法的主题漂移问题。 (2)深入研究了基于最大流的社区发现技术中边容量与社区的规模之间的关系,从社区发现角度分析了链接结构的特征,提出利用网页的入度和出度的概率分布来分配边容量的方法,减少了噪音页面被提取出来的可能性,提高了网络社区的质量。 (3)提出了基于邻域路径的有效相异度,强化了通过流形学习算法获得的数据低维表示的类别特征,改善了通过流形学习进行聚类的效果。分析了采用Nystrom扩展方法逼近大规模核矩阵特征向量的近似程度与抽样点之间的关系,并基于此分析提出了增量抽样策略,提高了利用Nystrom扩展方法进行加速流形聚类时的聚类质量。
[Abstract]:Data mining is one of the leading research fields in artificial intelligence, machine learning, pattern recognition and information decision-making. With the rapid development of Web and the improvement of data sampling ability, Web mining and high-dimensional data mining have become two important tasks of data mining. Web is the most important platform for people to spread and obtain information in modern society. The number of web pages contained in Web has reached one billion, and it is still increasing rapidly, and the amount of information contained in Web is increasing explosively. Because the information in Web is unstructured and self-organized, the traditional information retrieval technology is difficult to be effectively applied in the actual requirements. In addition to Web pages, there are plenty of hyperlinks in Web. Hyperlinks contain the importance evaluation information of web pages, so Web structure mining (I. E. Web link analysis) is the most important way to improve the quality of Web information retrieval. Clustering analysis is one of the basic methods of data mining and has been widely used in many fields. In recent years, many data in clustering problems generally show high dimensional features. However, the existing classical clustering methods are based on the assumption of low dimensional data space, and can not effectively cluster high-dimensional data. High-dimensional data clustering problem has become the focus of cluster analysis. Manifold clustering is a high dimensional data clustering method developed in recent years and widely studied. Aiming at the two typical problems of Web structure mining and high dimensional data clustering in data mining, this paper studies and analyzes the search engine page sorting algorithm based on link analysis and the Web community discovery algorithm. The effective dissimilarity measure in manifold clustering and the low rank approximation for large-scale high-dimensional data flow clustering are discussed in this paper. The main contributions are as follows: (1) the characteristics of PageRank and HITS algorithms based on link analysis are analyzed. The PageRank algorithm framework based on multilevel attenuation model is proposed. According to the attenuation model, the weights of direct and indirect links between pages are allocated, which improves the accuracy of query. An improved HITS algorithm based on page similarity and link popularity is proposed. The weights of links are assigned according to the similarity between pages relative to query topics and the popularity of links between pages. The problem of topic drift in HITS algorithm is effectively alleviated. (2) the relationship between the side capacity and the community size in the community discovery technology based on the maximum flow is deeply studied, and the characteristics of the link structure are analyzed from the perspective of community discovery. This paper proposes a method to allocate the edge capacity by using the probability distribution of the entry and output of the web pages, which reduces the possibility of the noise pages being extracted and improves the quality of the network community. (3) the effective dissimilarity degree based on the neighborhood path is proposed, which strengthens the class feature of the low dimensional representation of the data obtained by the manifold learning algorithm, and improves the clustering effect through the manifold learning. The relationship between the approximation degree of the eigenvector of the large scale kernel matrix and the sampling points by using the Nystrom extension method is analyzed. Based on this analysis, an incremental sampling strategy is proposed. The clustering quality of accelerating manifold clustering using Nystrom extension method is improved.
【学位授予单位】：大连理工大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP311.13

【相似文献】