基于属性选择加权的朴素贝叶斯算法的改进与应用

发布时间：2018-04-26 18:18

本文选题：数据挖掘 + 朴素贝叶斯　；参考：《西安理工大学》2017年硕士论文

【摘要】：随着信息技术的普及、大数据时代的到来,数据深度分析的需求也越来越大,数据挖掘技术便是一种实现从信息到知识转变的有效工具。而朴素贝叶斯算法是国际权威的数据挖掘学术会议评选出来的数据挖掘领域的十大经典算法之一,朴素贝叶斯模型发源于古典概率论,有着坚实的数学基础,以及稳定的分类效率。同时,它所需估计的参数少,对缺失数据不太敏感,算法也比较简单。理论上,朴素贝叶斯模型与其他分类算法相比具有最小的误差率。但是由于其假设属性之间相互独立,而实际应用中这个假设往往不成立。在属性个数较多或者属性之间相关性较大时,模型性能会降低。本文主要针对朴素贝叶斯算法的不足在属性选择和属性加权两个方面对其进行改进。在属性选择方面,先引入信息价值指标,得到第一个与类别相关度较高的属性子集,然后在此基础上进一步过滤冗余属性,得到第二个属性子集,分别在这两个属性子集上构造朴素贝叶斯分类模型。分析发现对初始属性集合进行两次属性选择构造的朴素贝叶斯分类模型既实现了属性降维的目的又提高了分类准确率。在属性加权方面,通过层次分析法量化经验知识,对样本训练的权值进行调整,得到更加全面的权值,根据属性取值的重要程度对朴素贝叶斯分类计算公式中的后验概率加权,提高分类准确率。然后结合属性选择和属性加权的优势,对朴素贝叶斯算法进行选择加权,该算法先通过信息价值指标对初始属性集进行二次属性选择,再通过层次分析法计算权值,在最优属性子集上构造加权朴素贝叶斯分类器,并在通用数据集上进行实验验证。最后将改进的朴素贝叶斯算法合理地应用到电信行业的垃圾短信用户识别模型中,通过在Spark平台上进行实验分析证明其有效性,从而进一步提高垃圾信息治理工作效果,优化垃圾信息治理的技术。
[Abstract]:With the popularization of information technology and the arrival of big data era, the demand of data depth analysis is also increasing. Data mining technology is an effective tool to realize the transformation from information to knowledge. The naive Bayesian algorithm is one of the ten classical algorithms in the field of data mining selected by the international authoritative conference on data mining. The naive Bayesian model originated from the classical probability theory and has a solid mathematical foundation. And stable classification efficiency. At the same time, it needs less estimation parameters, is not sensitive to missing data, and the algorithm is relatively simple. In theory, the naive Bayesian model has the smallest error rate compared with other classification algorithms. However, because the hypothesis attributes are independent of each other, this hypothesis is often not true in practical application. When the number of attributes is more or the correlation between them is large, the performance of the model will be reduced. In this paper, we improve the naive Bayes algorithm in two aspects: attribute selection and attribute weighting. In the aspect of attribute selection, we first introduce the information value index to get the first attribute subset which has high correlation with the category, and then filter the redundant attribute further, and get the second attribute subset. The naive Bayes classification model is constructed on these two attribute subsets. It is found that the naive Bayesian classification model based on the second attribute selection for the initial attribute sets not only achieves the goal of attribute dimension reduction but also improves the classification accuracy. In the aspect of attribute weighting, the weight of sample training is adjusted by quantifying the empirical knowledge through AHP, and the weight value is more comprehensive. According to the importance of attribute value, the posterior probability is weighted in the formula of naive Bayes classification. Improve the accuracy of classification. Then combining the advantages of attribute selection and attribute weighting, the naive Bayes algorithm is selected and weighted. The algorithm selects the initial attribute set by the information value index, and then calculates the weight value by the analytic hierarchy process (AHP). The weighted naive Bayes classifier is constructed on the optimal attribute subset and tested on the general data set. Finally, the improved naive Bayes algorithm is reasonably applied to the spam short message user identification model of the telecom industry. The effectiveness of the improved Bayesian algorithm is proved by the experimental analysis on the Spark platform, thus further improving the effect of garbage information management. Optimize the technology of garbage information management.
【学位授予单位】：西安理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18

【参考文献】