鲁棒非负矩阵分解算法研究

发布时间：2018-04-02 10:40

本文选题：数据挖掘　切入点：非负矩阵分解　出处：《北京交通大学》2017年硕士论文

【摘要】：随着互联网的发展,大数据时代悄无声息地走到了我们身旁,每天用户各种各样的行为产生了数以亿计的数据,这其中就包括了社交信息,购物信息以及浏览信息等。大量数据中包含着很多我们平常并不可见的用户行为规律,这些规律往往能带来更好的经济效益或者更高的工作效率等。因此,如何从海量的数据中找到对于自己来说有价值的信息成为了大数据时代的热点,数据挖掘正是在这种迫切的需求下应运而生。矩阵分解是数据挖掘中的一个重要研究领域,它被广泛地应用于图像和文本的挖掘中。但在实际应用中矩阵分解往往要面临图像像素值不能为负以及文档统计中负值没有意义等问题,如果不能对负值进行一个很好的处理,就会使算法的可解释性大大降低。为了增强可解释性,非负矩阵分解慢慢地进入了人们的视线。非负矩阵分解为分解后的基矩阵和系数矩阵增加了非负约束,这一约束很好地契合了一些实际应用场景中负值没有意义的特点,增强了算法的可解释性。除此之外,其还具有求解过程收敛速度快以及占用存储空间小的特点,这些优势使其非常适合作为大数据的处理方法。但是,经典的非负矩阵分解算法对于噪声数据的控制并不是很好,它对于误差的平方计算放大了噪声数据对算法结果的影响,限制了其在实际场景中的应用。在后续改进中,通过不再对数据点之间的冗余进行平方计算,只是进行简单地累加,在一定程度上降低了噪声数据的影响,但其不能很好地适应数据集中噪声数据比例的变化,致使其在一些数据集中不能得到理想的结果。本文针对此问题提出了两个非负矩阵分解算法,分别是截断式鲁棒非负矩阵分解算法以及双重截断式鲁棒非负矩阵分解算法。截断式鲁棒非负矩阵分解算法在基于L_(2,1)范数的鲁棒非负矩阵分解算法的基础上引入了数据点个数截断参数,用计算出的每个数据点的冗余与之进行比较,比之大者,截断为零,反之继续进行计算。这样就将误差大的噪声数据点剔除了出去,减小了对算法结果的影响,同时可以通过截断参数对数据集中噪声数据比例变化进行适应,增强了算法的鲁棒性。双重截断式鲁棒非负矩阵分解算法在截断式鲁棒非负矩阵分解算法的基础上更进一步,其更好地考虑了数据的本质结构,引入Ridge Leverage Score对识别噪声数据的计算标准进行了改进,同时增加了对噪声属性的处理,引入了用于控制噪声属性个数的截断参数。这些改进提高了结果的准确性,增强了算法的鲁棒性,使其能适应复杂的实际应用场景,得以广泛应用。
[Abstract]:With the development of the Internet, big data came quietly to us in the age of big data, and hundreds of millions of data were generated by the various behaviors of users every day, including social information. Shopping information and browsing information. A lot of data contains a lot of user behavior laws that we don't usually see. These laws often bring better economic benefits or higher work efficiency. How to find valuable information from the mass of data has become a hot spot in big data's era, and data mining comes into being in this urgent need. Matrix decomposition is an important research field in data mining. It is widely used in image and text mining, but in practical application, matrix decomposition often faces problems such as that the pixel value of the image cannot be negative and the negative value in document statistics has no meaning, if the negative value can not be processed well, the matrix decomposition often faces the problem that the pixel value of the image cannot be negative and the negative value in the document statistics is meaningless. In order to enhance interpretability, the nonnegative matrix factorization slowly enters the attention of people. The nonnegative matrix factorization adds nonnegative constraints to the basis matrix and coefficient matrix after decomposition. This constraint fits well with some characteristics of negative values in practical applications, and enhances the interpretability of the algorithm. In addition, it has the advantages of fast convergence and small storage space. These advantages make it very suitable for big data. However, the classical nonnegative matrix decomposition algorithm is not very good for the control of noise data, and it amplifies the effect of noise data on the result of the algorithm for the square calculation of errors. In the subsequent improvement, by not square the redundancy between the data points, it is simply accumulated to reduce the impact of noise data to a certain extent. However, it can not adapt to the change of noise data ratio in data set, so it can not get ideal results in some data sets. In this paper, two non-negative matrix decomposition algorithms are proposed to solve this problem. The truncated robust non-negative matrix factorization algorithm and the double truncated robust non-negative matrix factorization algorithm are respectively. The truncated robust non-negative matrix factorization algorithm is introduced on the basis of the robust non-negative matrix factorization algorithm based on the LS-1) norm. The number of data points is truncated. The redundancy of each calculated data point is compared with that of the calculated data point. The larger data points are truncated to zero, and the calculation is carried out on the contrary. Thus, the noise data points with large errors are eliminated and the influence on the algorithm results is reduced. At the same time, the robustness of the algorithm can be enhanced by using truncation parameters to adapt to the change of noise data scale in the dataset. The dual truncated robust non-negative matrix decomposition algorithm is further based on the truncated robust non-negative matrix decomposition algorithm. It considers the essential structure of the data better, and introduces the Ridge Leverage Score to improve the calculation standard of the noise recognition data, at the same time, the processing of the noise attribute is added. The truncation parameters used to control the number of noise attributes are introduced. These improvements improve the accuracy of the results, enhance the robustness of the algorithm, and enable it to adapt to complex practical application scenarios and be widely used.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.41

【参考文献】