自然邻居思想概念及其在数据挖掘领域的应用

发布时间：2018-07-28 14:44

【摘要】：近些年来,最近邻居的概念与对应的邻居搜索算法在数据挖掘、图像处理以及模式识别等多个领域中有着广泛的应用基础并取得了许多令人满意的结果。最近邻居概念中近邻关系的定义是最近邻居思想的根本基础,对该思想的各种方法起着决定性的作用。在众多的近邻关系中,最为广泛应用的无疑是k-最近邻居(KNN:k-Nearest Neighbor)和逆k最近邻居(RkNN:Reverse k-Nearest Neighbor)。然而,无论是KNN还是Rk NN,最近邻居的概念中始终萦绕着一个悬而未决的问题——如何选择合适的邻域大小。邻域参数的最优选择并不固定,其取值通常依赖数据集自身的分布情况。比如在对数据集进行分类操作时,较大的邻域参数能够减少噪声点对分类结果的影响,但同时会模糊不同类之间的差异性。特别是当数据集呈现流形分布时,过大的邻域参数会造成短路的现象,对流形产生破坏。而与之相反,过小的邻域参数会削弱数据间的邻居关系,极端情况下则会将属于同一个类的数据点分割到多个不同的区域中。因此,在当前基于最近邻居思想的算法中,邻域选择问题成为了制约算法效率的重要部分。为了从根本上解决这个问题,本文提出了自然邻居思想及其应用。首先,论文提出了自然邻居思想的基本概念。自然邻居思想摆脱了邻域参数选择的难题,在自然邻居的查找过程中自适应的完成邻居关系的构建,同时获得具有数据集特征信息的自然邻居特征值和自然邻居邻域图。自然邻居思想的主要特点为:1)自然邻居思想能根据不同数据集的局部特征创建对应的自然邻居邻域图,其能够直观准确地呈现数据分布规律,特别是流形数据和噪声数据。2)自然邻居思想能够对不同数据集自适应得到自然邻居特征值,而自然邻居特征值能够动态的反映不同数据集的分布状态。3)自然邻居思想中每个数据点的邻居数量是可变的,邻居的多少反映了数据点与数据集的真实关系。在自然邻居思想的概念之上,论文提出了自然邻居思想对传统算法中邻域参数选择问题的解决办法。自然邻居思想中的自然邻居特征值反映了数据集的分布情况,因此其可以作为传统最近邻居思想中的邻域参数k。基于该思路,论文提出了自然邻居特征值的快速计算算法,高效地计算自然邻居特征值,进而将其作为邻域参数应用于离群检测、聚类分析等领域的多个算法中,并且取得了令人满意的实验结果。除了自然邻居特征值之外,反映自然邻居关系的自然邻居邻域图也具有极强的研究价值。论文在最后提出了一种基于加权自然邻居邻域图的数据挖掘算法,将自然邻居查找过程中的查找深度作为自然邻居邻域图中边的权值构造加权自然邻居邻域图,在其基础上能够对任意分布的数据集一次性地进行聚类分析、离群挖掘和数据可视化分析。
[Abstract]:In recent years, the concept of nearest neighbor and the corresponding neighbor search algorithm have been widely used in many fields, such as data mining, image processing and pattern recognition, and many satisfactory results have been obtained. The definition of nearest neighbor in the concept of nearest neighbor is the fundamental basis of the thought of nearest neighbor, which plays a decisive role in various methods of this idea. Among the many nearest neighbor relationships, the most widely used ones are undoubtedly k-nearest neighbor (KNN:k-Nearest Neighbor) and inverse k-nearest neighbor (RkNN:Reverse k-Nearest Neighbor). However, whether KNN or Rk NN, the concept of nearest neighbor is always haunted by the question of how to choose the appropriate neighborhood size. The optimal selection of neighborhood parameters is not fixed, and its value usually depends on the distribution of the data set itself. For example, when classifying data sets, larger neighborhood parameters can reduce the influence of noise points on classification results, but at the same time, the differences between different classes will be blurred. Especially when the data set presents manifold distribution, excessive neighborhood parameters will cause short-circuit and convection destruction. On the other hand, too small neighborhood parameters will weaken the neighbor relationship between the data, and in extreme cases, the data points belonging to the same class will be divided into several different regions. Therefore, in the current algorithm based on nearest neighbor, neighborhood selection problem has become an important part of the algorithm efficiency constraints. In order to solve this problem fundamentally, this paper puts forward the idea of natural neighbor and its application. Firstly, the paper puts forward the basic concept of the idea of natural neighbor. The idea of natural neighbor gets rid of the difficult problem of neighborhood parameter selection and adaptively completes the construction of neighbor relationship in the process of natural neighbor search. At the same time the natural neighbor characteristic value and natural neighbor neighborhood graph with feature information of data set are obtained at the same time. The main feature of the natural neighbor thought is: (1) the natural neighbor thought can create the corresponding natural neighbor neighborhood graph according to the local characteristics of different data sets, and it can present the data distribution law intuitively and accurately. In particular, manifold data and noise data .2) the idea of natural neighbor can adaptively obtain the eigenvalues of natural neighbors for different data sets. The natural neighbor feature value can dynamically reflect the distributed state of different data sets. 3) the number of neighbors of each data point is variable in the concept of natural neighbor. The number of neighbors reflects the real relationship between data points and data sets. Based on the concept of natural neighbor, this paper proposes a solution to the problem of neighborhood parameter selection in traditional algorithm. The eigenvalue of natural neighbor in natural neighbor thought reflects the distribution of data set, so it can be used as the neighborhood parameter kin the traditional nearest neighbor thought. Based on this idea, this paper proposes a fast algorithm for calculating the eigenvalues of natural neighbors, which can be used as neighborhood parameters in many algorithms in the field of outlier detection, clustering analysis and so on. Satisfactory experimental results have been obtained. Besides the eigenvalue of natural neighbor, the graph of natural neighbor which reflects the relation of natural neighbor is also of great value. At the end of this paper, a data mining algorithm based on weighted natural neighbor graph is proposed. The search depth in the natural neighbor search process is taken as the weight value of the edge of the natural neighbor graph to construct the weighted natural neighbor graph. On the basis of this method, clustering analysis, outlier mining and data visualization can be performed on arbitrarily distributed data sets at one time.
【学位授予单位】：重庆大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】