当前位置:主页 > 科技论文 > 自动化论文 >

双权重随机森林预测算法及其并行化研究

发布时间:2018-12-21 20:03
【摘要】:随着科技的发展,大数据时代已经来临,在大数据时代,数据呈现爆炸式的增长。大数据给传统的机器学习方法带来很大的挑战,随机森林算法由于其良好的表现受到各界广泛关注。由于大数据的海量、复杂多样、变化快的特性,大数据带来两个问题:一个是机器学习算法运行时间长,不能在可接受的时间内提供结果。二是:数据维度高,冗余大,传统的随机森林回归算法没法得到理想的效果。为了解决这些问题,本课题对传统随机森林回归的改进及其并行化展开了研究。针对数据维度高,冗余大,传统的随机森林回归算法没法取得理想的效果这一问题,有文献提出改进传统随机森林算法中随机抽取特征为带权重的特征抽取。但是我们通过分析发现:大多数的相关研究都是针对分类问题,对于回归问题鲜有讨论,而很多针对分类的方法并不能直接应用到回归问题上;并且对特征权重衡量的方法,几乎都默认特征之间是独立的,但是在现实环境中,往往不是这样的。所以本课题针对回归问题采用了一种能将特征之间关系考虑在内的特征权重衡量算法,并且使用了两种方法进行特征抽取。同时我们进一步分析发现:将随机抽取特征改为带权重的特征抽取虽然提高了分类回归树模型的精度,但是同时增大了树模型之间的相关性,树模型之间的多样性减小,进而有可能影响随机森林回归算法整体的表现。针对这些问题,本文提出了一种双权重随机森林回归算法,除了给特征加权重以提高分类回归树的精度,同时对生成的分类回归树模型加权重,以期通过双权重的方法兼顾分类回归树的精度和多样性,以改善随机森林回归算法最终的预测性能。为了解决给分类回归树模型加权重的问题,本课题提出了两种新的能兼顾模型树精度和模型树之间多样性的模型权重计算方法:有放回的向前搜索的方法和基于多样性计算的方法。本文将这两种模型权重计算方法与两种特征抽取方法两两组合成四种双权重随机森林回归算法,并通过实验分析效果。针对大数据环境下,机器学习算法运行时间长,不能在可接受的时间内提供结果的问题,本文对双权重随机森林回归算法进行并行化设计与实现并通过实验分析并行化效果。
[Abstract]:With the development of science and technology, big data's time has come. Big data brings great challenge to the traditional machine learning method, and stochastic forest algorithm is paid more and more attention because of its good performance. Due to big data's characteristics of magnanimity, complexity, variety and rapid change, big data brings two problems: one is that machine learning algorithm has a long running time and can not provide results in acceptable time. Second, because of high data dimension and large redundancy, the traditional stochastic forest regression algorithm can not get ideal results. In order to solve these problems, the improvement and parallelization of traditional stochastic forest regression are studied. In view of the problem that the traditional stochastic forest regression algorithm can not achieve the ideal effect because of the high data dimension and large redundancy, some literatures have proposed to improve the traditional stochastic forest algorithm that the random extraction features are the feature extraction with weight. But we find that: most of the related studies are focused on classification problems, but there is little discussion on regression problems, and many methods for classification can not be directly applied to regression problems; And almost all the methods to measure the weight of features are independent of each other, but in the real environment, this is not always the case. Therefore, this paper uses a feature weight measurement algorithm which can take the relationship between features into account, and uses two methods for feature extraction. At the same time, we find that changing random extraction features to weighted feature extraction improves the accuracy of classification regression tree model, but also increases the correlation between tree models and reduces the diversity between tree models. Then it may affect the whole performance of stochastic forest regression algorithm. To solve these problems, this paper proposes a double-weight stochastic forest regression algorithm, which not only adds weight to the feature to improve the accuracy of the classification regression tree, but also adds weight to the generated model of the classification regression tree. In order to improve the prediction performance of the stochastic forest regression algorithm, the precision and diversity of the classification regression tree can be taken into account by the method of double weights. In order to solve the problem of adding weight to the classifying regression tree model, In this paper, we propose two new methods to calculate the weight of the model which can take into account the accuracy of the model tree and the diversity of the model tree: the method of forward searching with return and the method based on diversity calculation. In this paper, we combine the two model weight calculation methods and two feature extraction methods into four double weight stochastic forest regression algorithms, and analyze the results through experiments. In order to solve the problem that machine learning algorithm has long running time and can not provide results in acceptable time under big data environment, this paper designs and implements a two-weight stochastic forest regression algorithm and analyzes the parallelization effect through experiments.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP18

【参考文献】

相关期刊论文 前1条

1 何清;李宁;罗文娟;史忠植;;大数据下的机器学习算法综述[J];模式识别与人工智能;2014年04期



本文编号:2389388

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2389388.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户02e61***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com