基于MapReduce的分布式改进随机森林学生就业数据分类模型研究
发布时间:2018-04-08 07:27
本文选题:机器学习 切入点:数据分类模型 出处:《系统工程理论与实践》2017年05期
【摘要】:教育数据挖掘(educational data mining)是当代教育信息化发展的前沿研究领域,正在吸引越来越多教育学家和数据科学家的关注."大数据"时代背景下,随着数据处理规模的不断激增,现有的数据挖掘模型在单一处理节点的计算能力遭遇瓶颈,各类面向大数据处理的分布式计算框架应运而生.借助这些框架,面向解决高校就业数据挖掘问题的机器学习模型便可以满足未来大规模数据处理的需求,在未来数据集体量庞大的信息集成系统中为数据挖掘和决策支持提供帮助.以此为背景,本研究对比现有数据模型对研究目标对象的分类性能,提出了以引入输入特征加权系数来计算特征的信息增益作为特征最优分裂评判指标的改进随机森林模型来提升数据分类性能,通过仿真测试改进模型对于现有模型分类性能的提升情况,与此同时为解决大数据时代背景下面向海量数据分类任务的单节点性能瓶颈问题,提出了基于分布式改进随机森林算法的大规模学生就业数据分类预测模型.通过使用MapReduce分布式计算框架实现已训练模型在本地磁盘与分布式文件系统之间的序列化写入与反序列化加载过程,进而实现了基于改进随机森林模型的大规模数据分类模型的分布式扩展.
[Abstract]:Educational data mining (EDM) is a frontier research field in the development of modern educational informatization, which is attracting more and more attention of educators and data scientists. "Under the background of big data, with the rapid increase of data processing scale, the computing power of existing data mining models in a single processing node has met a bottleneck, and various distributed computing frameworks for big data processing have emerged as the times require.With these frameworks, the machine learning model for solving the problem of employment data mining in colleges and universities can meet the needs of large-scale data processing in the future.It is helpful for data mining and decision support in the information integration system with large volume of data sets in the future.Against this background, this study compares the classification performance of the existing data models to the target objects.An improved stochastic forest model is proposed in which the information gain of the feature is calculated by introducing the weighted coefficient of the input feature as the index of feature optimal split evaluation to improve the performance of data classification.In order to solve the problem of single node performance bottleneck of mass data classification task in big data era, the improved model improves the classification performance of existing models through simulation test, and at the same time, in order to solve the bottleneck of single node performance in the context of big data era,Based on distributed improved stochastic forest algorithm, a large scale student employment data classification and prediction model is proposed.The serialization writing and deserialization loading process of the trained model between the local disk and the distributed file system is realized by using the MapReduce distributed computing framework.Then the distributed extension of large-scale data classification model based on improved stochastic forest model is realized.
【作者单位】: 同济大学电子与信息工程学院CIMS中心;
【基金】:国家自然科学基金(71690234)~~
【分类号】:G647.38;TP311.13
,
本文编号:1720601
本文链接:https://www.wllwen.com/jiaoyulunwen/gaodengjiaoyulunwen/1720601.html