当前位置:主页 > 科技论文 > 自动化论文 >

基于属性选择加权的朴素贝叶斯算法的改进与应用

发布时间:2018-04-26 18:18

  本文选题:数据挖掘 + 朴素贝叶斯 ; 参考:《西安理工大学》2017年硕士论文


【摘要】:随着信息技术的普及、大数据时代的到来,数据深度分析的需求也越来越大,数据挖掘技术便是一种实现从信息到知识转变的有效工具。而朴素贝叶斯算法是国际权威的数据挖掘学术会议评选出来的数据挖掘领域的十大经典算法之一,朴素贝叶斯模型发源于古典概率论,有着坚实的数学基础,以及稳定的分类效率。同时,它所需估计的参数少,对缺失数据不太敏感,算法也比较简单。理论上,朴素贝叶斯模型与其他分类算法相比具有最小的误差率。但是由于其假设属性之间相互独立,而实际应用中这个假设往往不成立。在属性个数较多或者属性之间相关性较大时,模型性能会降低。本文主要针对朴素贝叶斯算法的不足在属性选择和属性加权两个方面对其进行改进。在属性选择方面,先引入信息价值指标,得到第一个与类别相关度较高的属性子集,然后在此基础上进一步过滤冗余属性,得到第二个属性子集,分别在这两个属性子集上构造朴素贝叶斯分类模型。分析发现对初始属性集合进行两次属性选择构造的朴素贝叶斯分类模型既实现了属性降维的目的又提高了分类准确率。在属性加权方面,通过层次分析法量化经验知识,对样本训练的权值进行调整,得到更加全面的权值,根据属性取值的重要程度对朴素贝叶斯分类计算公式中的后验概率加权,提高分类准确率。然后结合属性选择和属性加权的优势,对朴素贝叶斯算法进行选择加权,该算法先通过信息价值指标对初始属性集进行二次属性选择,再通过层次分析法计算权值,在最优属性子集上构造加权朴素贝叶斯分类器,并在通用数据集上进行实验验证。最后将改进的朴素贝叶斯算法合理地应用到电信行业的垃圾短信用户识别模型中,通过在Spark平台上进行实验分析证明其有效性,从而进一步提高垃圾信息治理工作效果,优化垃圾信息治理的技术。
[Abstract]:With the popularization of information technology and the arrival of big data era, the demand of data depth analysis is also increasing. Data mining technology is an effective tool to realize the transformation from information to knowledge. The naive Bayesian algorithm is one of the ten classical algorithms in the field of data mining selected by the international authoritative conference on data mining. The naive Bayesian model originated from the classical probability theory and has a solid mathematical foundation. And stable classification efficiency. At the same time, it needs less estimation parameters, is not sensitive to missing data, and the algorithm is relatively simple. In theory, the naive Bayesian model has the smallest error rate compared with other classification algorithms. However, because the hypothesis attributes are independent of each other, this hypothesis is often not true in practical application. When the number of attributes is more or the correlation between them is large, the performance of the model will be reduced. In this paper, we improve the naive Bayes algorithm in two aspects: attribute selection and attribute weighting. In the aspect of attribute selection, we first introduce the information value index to get the first attribute subset which has high correlation with the category, and then filter the redundant attribute further, and get the second attribute subset. The naive Bayes classification model is constructed on these two attribute subsets. It is found that the naive Bayesian classification model based on the second attribute selection for the initial attribute sets not only achieves the goal of attribute dimension reduction but also improves the classification accuracy. In the aspect of attribute weighting, the weight of sample training is adjusted by quantifying the empirical knowledge through AHP, and the weight value is more comprehensive. According to the importance of attribute value, the posterior probability is weighted in the formula of naive Bayes classification. Improve the accuracy of classification. Then combining the advantages of attribute selection and attribute weighting, the naive Bayes algorithm is selected and weighted. The algorithm selects the initial attribute set by the information value index, and then calculates the weight value by the analytic hierarchy process (AHP). The weighted naive Bayes classifier is constructed on the optimal attribute subset and tested on the general data set. Finally, the improved naive Bayes algorithm is reasonably applied to the spam short message user identification model of the telecom industry. The effectiveness of the improved Bayesian algorithm is proved by the experimental analysis on the Spark platform, thus further improving the effect of garbage information management. Optimize the technology of garbage information management.
【学位授予单位】:西安理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP18

【参考文献】

相关期刊论文 前10条

1 魏浩;丁要军;;一种基于相关的属性选择改进算法[J];计算机应用与软件;2014年08期

2 张步良;;基于分类概率加权的朴素贝叶斯分类方法[J];重庆理工大学学报(自然科学);2012年07期

3 廖红强;邱勇;杨侠;王星刚;葛任伟;;对应用层次分析法确定权重系数的探讨[J];机械工程师;2012年06期

4 张东亮;董礼;;基于改进的朴素贝叶斯算法在垃圾短信过滤中的研究[J];计算机测量与控制;2012年02期

5 曹根;葛孝X;杨丽琴;;基于K-近邻法的局部加权朴素贝叶斯分类算法[J];计算机应用与软件;2011年09期

6 龚之闻;;不基于短信内容的垃圾短信识别模型[J];科技信息;2011年07期

7 陈朝大;梁柱勋;郑士基;;一种利用关联规则的改进朴素贝叶斯分类算法[J];计算机系统应用;2010年11期

8 范敏;石为人;;层次朴素贝叶斯分类器构造算法及应用研究[J];仪器仪表学报;2010年04期

9 刘勇;熊蓉;褚健;;Hash快速属性约简算法[J];计算机学报;2009年08期

10 张明卫;王波;张斌;朱志良;;基于相关系数的加权朴素贝叶斯分类算法[J];东北大学学报(自然科学版);2008年07期

相关博士学位论文 前2条

1 蒋良孝;朴素贝叶斯分类器及其改进算法研究[D];中国地质大学;2009年

2 陈景年;选择性贝叶斯分类算法研究[D];北京交通大学;2008年



本文编号:1807117

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1807117.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户959d4***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com