基于机器学习的毕业生收入预测与分析研究
发布时间:2018-04-05 18:37
本文选题:教育信息 切入点:数据挖掘 出处:《吉林大学》2017年硕士论文
【摘要】:随着信息化时代的到来,信息技术不断影响并改变着经济、社会、文化、生活的方方面面,其中,教育领域同样由于信息技术的变革而受到深远的影响。教育信息数据库的容量因此而变得越来越大,针对这些大规模的数据,教育领域迫切需要一种高效的信息技术,对数据进行处理、分析和运用,并且在此基础上进一步挖掘出对不同层次教育从业者有用的信息。基于以上研究背景,本文以机器学习算法为工具,对美国大学推荐网站Score Card上使用的数据集进行深度分析,建立了以学校特征为输入,以学校毕业生平均收入为输出的回归和分类模型。通过使用该模型,可以通过一所大学的各项特征参数来合理预测该学校毕业生的平均收入,这将会对教育部门助学金等资金的有效分配和私立学校的创办都会起到很好的辅助作用。本文的主要工作如下:1.使用单变量线性回归算法对每个大学级别的特征与目标值之间的关系建立模型,分析单个特征变量对毕业生平均收入影响,对其含义进行解读。对比多变量回归模型和KNN回归模型在预测毕业生平均收入上的表现情况。2.提出了融合KNN回归的KNN多项式回归算法。此种算法在验证集上的表现要好于多变量回归算法和KNN算法,但是训练时间相对较长,好在预测毕业生平均收入这个问题并不是一个数据项会经常变动的问题,因此及时此算法的时间复杂度是两种基础算法时间复杂度之和,它在解决回归问题时的优势还是非常明显的。3.使用四种方法对毕业生的平均收入进行分类,这四种方法分别是逻辑回归、决策树、KNN和Adaboost。在这四种算法中,Adaboost算法的分类准确率最高,KNN算法的分类准确率最低,甚至还不如随机预测。且使用逻辑回归算法时出现了召回率为100%的特殊情况。4.提出了基于召回率的逻辑回归算法。如果训练出的逻辑回归模型在验证集和训练集上的召回率或精确率过高,就可以把训练集根据过高项的指标进行划分,对划分出的子模块进行训练。这样原本一层的模型就会变成两层,模型的实际精确度需要在验证集上进行验证。模型可以无限递归下去,直到模型在验证集上的精确度开始随着模型深度的增加而下降。
[Abstract]:With the arrival of the information age, information technology is constantly influencing and changing all aspects of economy, society, culture and life, among which, the field of education is also affected by the change of information technology.As a result, the capacity of educational information database becomes larger and larger. In view of these large-scale data, the field of education urgently needs an efficient information technology to process, analyze and use the data.And on this basis, further excavate the useful information for different levels of education practitioners.Based on the above research background, this paper takes the machine learning algorithm as the tool, carries on the deep analysis to the data set used on the Score Card, the American university recommendation website, and establishes takes the school characteristic as the input,A regression and classification model based on the average income of school graduates.By using the model, the average income of a college graduate can be reasonably predicted by the characteristic parameters of the university.This will play a good role in the effective allocation of funds such as educational sector grants and the establishment of private schools.The main work of this paper is as follows: 1.The univariate linear regression algorithm is used to establish a model of the relationship between the characteristics and the target value of each university level. The effect of a single feature variable on the average income of graduates is analyzed and its meaning is interpreted.Compared with multivariate regression model and KNN regression model in predicting the average income of graduates. 2. 2.A KNN polynomial regression algorithm based on KNN regression is proposed.The algorithm performs better on the verification set than the multivariate regression algorithm and the KNN algorithm, but the training time is relatively long. Fortunately, the problem of predicting the average income of graduates is not a matter of constant change of data items.Therefore, the time complexity of this algorithm is the sum of the time complexity of the two basic algorithms, and its advantage in solving the regression problem is still very obvious.Four methods are used to classify the average income of graduates, which are logical regression, decision tree KNN and Adaboost.Among the four algorithms, Adaboost has the highest classification accuracy and KNN has the lowest classification accuracy, even worse than random prediction.And when using the logical regression algorithm, a special case with a recall rate of 100%. 4. 4.A logical regression algorithm based on recall rate is proposed.If the trained logical regression model has a high recall rate or precision rate on the verification set and the training set, the training set can be divided according to the index of too high terms, and the submodules can be trained.In this way, the original one layer model will become two layers, and the actual accuracy of the model needs to be verified on the verification set.The model can be recursion indefinitely until the accuracy of the model on the verification set begins to decline as the depth of the model increases.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181
【参考文献】
相关期刊论文 前5条
1 熊才平;何向阳;吴瑞华;;论信息技术对教育发展的革命性影响[J];教育研究;2012年06期
2 常桐善;;构建院校智能体系:院校研究发展的新趋势[J];高等教育研究;2009年10期
3 丁卫平;王杰华;管致锦;;基于数据挖掘技术的教学评估智能辅助决策平台的设计与实现[J];电化教育研究;2009年04期
4 陶剑文;黄崇本;;Web Usage Mining在网络教学中的应用研究[J];情报杂志;2006年05期
5 庞先伟;基于数据挖掘技术的资源型学习[J];现代远程教育研究;2002年03期
,本文编号:1715970
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1715970.html