稀疏优化在机器学习中的若干应用

发布时间：2018-05-09 11:57

本文选题：非光滑优化 + 稀疏优化　；参考：《大连理工大学》2013年博士论文

【摘要】：近年来,利用解的稀疏性和其他内在结构成为众多计算和工程领域中共同关注的问题.稀疏的内含不仅是指“只有很少的非零分量”,它蕴含着“具有一种简单结构”.本文对机器学习中不同问题的稀疏结构进行建模,并在必要时改进经典的稀疏优化算法进行求解.论文的主要工作可概括如下： 1.第2章给出了本文在解决不同的机器学习问题中所提出的稀疏优化模型及算法.所提出的稀疏优化模型有同样的抽象结构,即在一个具有某种简单或特定结构的假设空间上极小化某个损失泛函.本文中给出的盒子约束的Lasso模型及块PCA模型均具有这一结构.该章给出了求解盒子约束的Lasso模型的同伦算法及求解块PCA模型的Splitting算法. 2.第3章研究了求解盒子约束的Lasso模型的同伦算法的收敛性并检验了该算法的数值性能.该章的工作指出同伦算法收敛性不是显然成立.在无退化指标假设和其它较弱的条件下,该章证明了同伦算法具有有限终止性.另外,该章讨论了退化和循环的问题.当前已有众多算法可求解该模型,但数值实验证明同伦算法具有特别的优势：适于最优解非常稀疏的问题及需要计算整条正则化路径的情形.这是第4章协同过滤数据可预测性问题的计算中所采用的关键技术. 3.第4章研究了协同过滤问题中评分数据的可预测性问题.当前协同过滤方面的大部分工作主要研究算法性能的改进.该章指出,受评分数据自身的限制,评分矩阵中有一部分未知评分是难于给出准确预测的.第4章提出了一个新的度量——相关性,以度量用户在某个商品上的评分能被准确预测的可能性.一个用户一商品对的相关性由相关的用户和商品构成的社区所确定.作为相关性度量的应用,提出了基于数据的组合方法(DOC)以应用于推荐系统. 4.第5章研究从时间序列基因表达数据中推断基因正则化网络(GRN).由于计算复杂度较大,大部分GRN重建方法仅限于推断较低连通性的单个网络.该章提出了网络和社区识别方法,结合社区结构信息,从基因表达数据中推断多个子网络.其中的块PCA模型,通过第2章给出的Splitting算法,可有效求解网络中的社区结构. 5.第6章研究了作为蛋白质鉴别关键步骤的肽段识别问题.序列数据库搜索是当前肽段识别的主流方法.但搜索引擎给出的大量的匹配是不正确的.现有方法大多基于半监督或监督学习框架,充分利用了诱骗PSM的样本及标签信息,但目标PSM样本点自身信息没有被充分利用.该章提出了一个称为FC-Ranker的新的评分方法,给每个目标PSM赋予一个非负权重,反映其匹配正确的可能性.特别地,FC-Ranker通过模糊支持向量机分类模型和所提出的模糊Silhouette指标迭代更新该权重.FC-Ranker在ROC指标、相同FDR水平下鉴别目标PSM的数目等方面的性能表现超过了主流后验数据库搜索方法.
[Abstract]:In recent years, the use of sparse solutions and other internal structures has become a common concern in many fields of computation and engineering. Sparse inclusion means not only that there are few non-zero components, but also that there is a simple structure. In this paper, the sparse structure of different problems in machine learning is modeled, and the classical sparse optimization algorithm is improved to solve the problem when necessary. The main work of the thesis can be summarized as follows: 1. In chapter 2, the sparse optimization model and algorithm for solving different machine learning problems are given. The proposed sparse optimization model has the same abstract structure, that is, minimizing a loss functional in a hypothetical space with some simple or specific structure. Both the box constrained Lasso model and the block PCA model presented in this paper have this structure. In this chapter, the homotopy algorithm for solving box constrained Lasso model and the Splitting algorithm for solving block PCA model are given. 2. In chapter 3, the convergence of homotopy algorithm for Lasso model with box constraints is studied and its numerical performance is tested. The work of this chapter points out that the convergence of homotopy algorithm is not obvious. Under the assumption of no degenerate index and other weaker conditions, the homotopy algorithm is proved to be finitely terminated in this chapter. In addition, the problem of degradation and cycle is discussed in this chapter. At present, there are many algorithms to solve the model, but the numerical experiments show that the homotopy algorithm has a special advantage: it is suitable for the problem where the optimal solution is very sparse and the entire regularization path needs to be calculated. This is the key technique used in the calculation of the predictability of cooperative filtering data in Chapter 4. 3. Chapter 4 studies the predictability of scoring data in collaborative filtering. Most of the current collaborative filtering efforts are focused on improving the performance of the algorithm. It is pointed out in this chapter that due to the limitation of the scoring data, it is difficult to predict accurately some unknown scores in the scoring matrix. In Chapter 4, a new metric, correlation, is proposed to measure the probability that a user's score on a product can be accurately predicted. The relevance of a user-to-commodity pair is determined by the associated user and the community of goods. As an application of correlation measurement, a data-based combination method (DOC) is proposed to be used in recommendation systems. 4. Chapter 5 studies the inference of gene regularization network from time series gene expression data. Because of the high computational complexity, most of the GRN reconstruction methods are limited to inferring a single network with low connectivity. In this chapter, network and community identification methods are proposed, and multiple subnetworks are inferred from gene expression data combined with community structure information. The block PCA model can effectively solve the community structure in the network through the Splitting algorithm given in Chapter 2. 5. In chapter 6, the recognition of peptide segment as the key step of protein identification is studied. Sequence database search is the main method of peptide recognition. But the search engine gives a large number of matches is incorrect. Most of the existing methods are based on semi-supervised or supervised learning framework, which make full use of the sample and label information of decoy PSM, but the information of target PSM sample points is not fully utilized. In this chapter, a new scoring method called FC-Ranker is proposed, which assigns a non-negative weight to each target PSM, reflecting the possibility of a correct match. In particular, FC-Ranker iteratively updates the weight through the fuzzy support vector machine classification model and the proposed fuzzy Silhouette index. The performance of FC-Ranker in terms of ROC index and the number of target PSM at the same FDR level is better than that of the mainstream posteriori database search method.
【学位授予单位】：大连理工大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP181

【参考文献】