基于多种支撑点的度量空间离群检测算法

发布时间：2018-05-06 11:23

本文选题：离群检测 + 度量空间　；参考：《计算机学报》2017年12期

【摘要】：大数据的价值实现,归根到底还是依赖于数据挖掘技术.而在很多领域中,海量数据的非常规模式往往更具分析价值.离群检测,也叫异常检测,是用于挖掘海量数据中非常规模式的一项关键技术,广泛应用于网络入侵检测、公共卫生、医疗监控等领域.基于索引的离群检测算法通常具有较高的检测速度,然而现有的大多数基于索引的检测算法并非完全基于距离,导致通用性降低.较高的抽象能力使得度量空间具有比多维空间更广泛的适用范围,在其基础上设计的算法具有更高的通用性.而最新的度量空间基于索引的离群检测算法iORCA算法通过随机选取支撑点,基于数据到单支撑点的距离建立索引,并应用终止规则(Stopping rule)以期提前结束离群检测并得到正确的结果,多数情况下该机制起到加快检测速度的重要作用.然而iORCA算法未提供支撑点选取算法导致检测结果不稳定,且未能充分利用距离三角不等性减少距离计算次数.针对这些问题,文中指出基于距离的离群点定义应结合使用完全基于距离的离群检测算法,以确保算法的通用性,由此提出了度量空间离群检测的概念.在此基础上明确了支撑点选取的两大目标,即边缘支撑点和密集支撑点,并提出基于多种支撑点的度量空间离群检测算法VPOD.考虑到两个支撑点选取目标难以同时达到,VPOD算法分别予以选取,在近似的密集区域选取支撑点,即密集支撑点,对应使用终止规则,然后用FFT(Farthest-First Traversal)算法另选取若干支撑点,即边缘支撑点,与数据集计算距离而形成支撑点空间,利用距离三角不等性,使距离计算次数显著减少,从而提高检测速度.实验表明该算法能在可接受的时间范围内建立索引,并能高效检测离群点,加速比达2.05,最高达3.54,距离计算次数平均减少51.14%,最高达89.46%,同时保持对多种常见的基于距离的离群点定义的兼容.
[Abstract]:Big data's value realization, in the final analysis still depends on the data mining technology. In many fields, unconventional patterns of massive data are often more analytical. Outlier detection, also called anomaly detection, is a key technology used to mine irregular patterns of massive data. It is widely used in network intrusion detection, public health, medical monitoring and other fields. Indexes based outlier detection algorithms usually have a high detection speed, but most of the existing indexing based detection algorithms are not completely based on distance, which leads to the reduction of generality. Because of its high abstract ability, the metric space has a wider range of applications than the multidimensional space, and the algorithm designed on the basis of it has a higher universality. The most recent outlier detection algorithm based on index in metric space, iORCA algorithm, establishes index based on the distance from data to single support point by randomly selecting support points, and applies termination rule to stop detection in order to finish outlier detection in advance and get correct results. In most cases, this mechanism plays an important role in accelerating the detection speed. However, the iORCA algorithm does not provide the support point selection algorithm, which leads to the instability of the detection results, and does not make full use of the distance triangulation to reduce the number of distance calculations. Aiming at these problems, this paper points out that the definition of outlier based on distance should be combined with a completely distance-based outlier detection algorithm to ensure the generality of the algorithm, and the concept of metric spatial outlier detection is proposed. On this basis, two major targets of support point selection, namely edge support point and dense support point, are defined, and VPOD, a metric spatial outlier detection algorithm based on multiple support points, is proposed. Considering that it is difficult to select two support points at the same time, we select the support points in the approximate dense area, that is, the dense support points, and then use the termination rule, and then select several other support points using the FFT(Farthest-First training algorithm. That is, the edge support point forms the support point space by calculating the distance from the data set. By using the distance triangle inequality, the distance calculation times are significantly reduced, and the detection speed is improved. Experiments show that the algorithm can build index in acceptable time range, and can detect outliers efficiently. Accelerating Prida 2.05, with a maximum of 3.54, reduces the number of distance calculations by an average of 51.14 and reaches a maximum of 89.46, while maintaining compatibility with a variety of common distance-based outliers.
【作者单位】：佛山科学技术学院数学与大数据学院;深圳大学计算机与软件学院广东省普及型高性能计算机重点实验室;南开大学化学学院;
【基金】：国家“八六三”高技术研究发展计划项目基金(2015AA015305) 国家自然科学基金委-广东联合项目(U1301252,U1501254) 广东省重点实验室建设情况考评项目(2017B030314073) 广东省自然科学基金(2015A030313636) 深圳市科技计划项目(CXZZ20140418182638764)资助~~
【分类号】：TP311.13

【相似文献】