分类数据中高维列联表可压缩性研究
发布时间:2018-05-30 04:21
本文选题:列联表压缩 + 辛普森悖论 ; 参考:《厦门大学》2014年硕士论文
【摘要】:分类数据的统计分析方法是分析名义数据和有序数据的重要工具,在分类数据分析中,用列联表对数据进行分析是一种常用、直观的方法,例如,医学研究者按年龄和性别对病例进行分类建立列联表:教育工作研究人员按年龄、性别和家庭背景对学生进行分类建立列联表;经济研究者按照行业、地区、初始投资对企业成败进行分类建立列联表:市场研究者按年龄、性别和对商品的消费倾向进行分类建立列联表等。 传统的分类数据分析方法主要是对列联表进行独立性检验,随着对数线性模型的提出以及广泛应用,使得分类数据分析方法经常用于分析高维列联表,但是国内外文献中缺少对高维列联表的详细分析方法。由于高维列联表数据资料的复杂性,在分析高维列联表的时候为了更好地分析数据中变量的相关性,需要通过一些方式对列联表进行降维,也即对列联表中变量进行压缩,但不合理的压缩会导致辛普森悖论、虚假相关、虚假独立三种现象的产生,这就增大了分析列联表的难度,所以研究列联表可压缩性的方法非常重要,国内外学者对三维列联表已经有些研究,但仍缺少对高维列联表的可压缩性方面的研究。 本文通过基于交互作用与互信息、信息熵三种角度对列联表的可压缩性进行分析研究,深入探讨高维列联表可压缩的条件和实现途径,研究发现: 1、对于三维列联表只要满足变量之间存在条件独立列联表就可压缩,但对于四维列联表,尽管变量之间存在条件独立并不能保证列联表可压缩; 2、基于交互作用的对数线性模型与基于互信息的线性信息模型之间存在等价条件,两种模型分析的结果可以互相利用; 3、给出了线性信息模型设定条件变量与不设定条件变量的模型选择方法,发现所拟合的线性信息模型比对数线性模型更加简洁,在交互作用下的模型显示不可压缩,但在互信息下的模型显示可以压缩; 4、给出了基于互信息和信息熵列联表变量可压缩的方法,发现基于互信息的可压缩性方法是在考虑了变量相关性的角度对列联表进行的压缩,在压缩过程中允许损失部分不显著的相关信息;基于信息熵的可压缩性方法是在考虑变量含有不确定信息的多少而对列联表进行的压缩,在压缩的过程中不允许损失变量的任何信息; 5、给出了两种分别基于互信息和信息熵对列联表变量重要性的排序方法,发现从列联表可压缩性的角度,基于互信息的变量重要性排序方法更加准确。而从变量含有的不确定信息多少的角度,基于信息熵的变量重要性排序方法更加准确。 研究的成果对分类数据分析方法的研究深入发展做出新的贡献,对高维列联表的可压缩性方法提供了一些重要可实现的途径。
[Abstract]:Statistical analysis of classified data is an important tool for analyzing nominal and ordered data. In the analysis of classified data, it is a common and intuitive method to use column tables to analyze data, such as, Medical researchers classified cases according to age and sex. Educational researchers classified students according to age, sex and family background. The initial investment classifies the success or failure of the enterprise. The market researcher classifies the success or failure of the enterprise by age, sex and the consumption tendency of the commodity. The traditional classification data analysis method is mainly to test the independence of the column table. With the development of the logarithmic linear model and its wide application, the classification data analysis method is often used to analyze the high-dimensional column table. However, there is a lack of detailed analysis method of high-dimensional table in domestic and foreign literature. Because of the complexity of the data in the high-dimensional column table, in order to better analyze the correlation of variables in the data, it is necessary to reduce the dimension of the column table by some means, that is, to compress the variables in the column table. However, unreasonable compression will lead to three phenomena: Simpson paradox, false correlation and false independence, which increase the difficulty of analyzing the table, so it is very important to study the compressibility of the list. Scholars at home and abroad have done some research on the three-dimensional table, but there is still a lack of research on the compressibility of the high-dimensional table. In this paper, based on interaction and mutual information, information entropy is used to analyze the compressibility of the column table, and the conditions and the way to realize the compressibility of the high dimensional table are discussed in depth. The results show that: 1. As long as the conditional independent column table between the variables is satisfied, the three dimensional column table can be compressed, but for the four dimensional column table, although the conditional independence between the variables can not guarantee the compressibility of the column coupling table; (2) there are equivalent conditions between the logarithmic linear model based on interaction and the linear information model based on mutual information, and the results of the two models can be used mutually; 3. A model selection method for linear information model with or without conditional variables is given. It is found that the fitted linear information model is more concise than the logarithmic linear model, and the model under interaction is incompressible. But the model display under mutual information can be compressed; 4. A compressible method based on mutual information and information entropy is given. It is found that the compressibility method based on mutual information is the compression of the column table considering the correlation of variables. The compressibility method based on information entropy is to compress the column table considering how much uncertain information the variable contains. No information about lost variables is allowed during compression; 5. Two sorting methods based on mutual information and information entropy to rank the importance of column table variables are presented, and it is found that the method based on mutual information is more accurate from the point of view of column table compressibility. From the point of view of the uncertain information contained in variables, the importance ranking method based on information entropy is more accurate. The results of the study make a new contribution to the further development of the analytical methods of classified data, and provide some important and feasible ways for the compressibility method of high dimensional column tables.
【学位授予单位】:厦门大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:C81
【参考文献】
相关期刊论文 前5条
1 唐先勇;3—维列联表中对数线性模型的选择策略[J];零陵学院学报;2003年S1期
2 李开灿;列联表中辅助交互作用的可压缩性[J];应用概率统计;1998年02期
3 郭建华,马文卿;辅助交互作用的有序可压缩性[J];应用概率统计;2001年01期
4 张岩波,何大卫;对数线性模型的最优模型筛选策略[J];中国卫生统计;1996年06期
5 程中兴;;非线性视角下辛普森悖论的统计解释[J];统计科学与实践;2011年01期
,本文编号:1953902
本文链接:https://www.wllwen.com/shekelunwen/shgj/1953902.html