股市数据挖掘中偏模型的检验和应用

发布时间：2018-01-15 20:06

本文关键词：股市数据挖掘中偏模型的检验和应用　出处：《西南财经大学》2014年硕士论文　论文类型：学位论文

【摘要】：中国股市已经走过24年风雨历程,这一路跌跌撞撞,起步虽晚的中国股市在不断地进行着自我探索,又在不断地自我否定中理性回归。这24年来,面对尚未达到弱式有效的股票市场,各界专家学者做了大量关于股市特点及股市预测方面的研究。现今的研究主要可分为两大派别：基本面分析和技术面分析。基本面分析承认股票价格是公司内在价值的反映,注重对分析变量的选择；技术面分析则以历史上的开盘价、收盘价、最高价、最低价等等作为预测未来股价的丰沃土壤,注重对数据处理方法和模型建立方法的选择。应该说两大派别体系不同,各有千秋。但无论如何,中国股市未达到弱式有效是不争的事实,股票价格序列历史相关,技术面分析有其立足点。本文隶属于技术面分析。从现有的技术面分析方法来看,大致有时间序列分析法、模糊数学、混沌理论、数据挖掘等分析技术。其中的数据挖掘技术是近些年来随着数据量几何式增长出现的一种新的处理大量数据的技术,它事先并不规定待探索信息的形式,而是让数据本身来说话。时下流行的数据挖掘技术有决策树、神经网络、支持向量机、聚类分析等。而每一种技术本身又可有多种实现算法。毫无疑问,面对庞大纷杂的股票数据,数据挖掘技术是一种很好的处理方法。目前各学者在用数据挖掘技术研究股票市场时,主要从挖掘技术本身的算法设计及改进、股市变量的选择以及处理、股票数据的使用方式以及不同挖掘模型的组合使用几个方面进行研究改进。本文亦选择数据挖掘技术作为研究股票市场的起点,但尝试从一个全新的角度对这种技术进行探索、改进,提出数据挖掘偏模型的概念。数据挖掘偏模型的概念起初是源于对分类树特有模型结构的思考。分类树模型的输出结果是一棵拥有很多片树叶的树,它的每一片树叶都代表了一条知识表述,有多少片树叶,就有多少条知识表述。在实际应用中,这些知识表述的利用价值有所不同：有些树叶所阐述的知识屡试不爽,预测正确率很高,而有些树叶所阐述的知识几乎没有利用价值,预测正确率极低。因此如果把每一片树叶都看作是一个子模型,就可以对每一个子模型都进行预测正确率的计算而不是对模型整体进行正确率的计算,寻找到那些正确率较高的子模型并将其它正确率较低的子模型予以放弃就是建立偏模型的过程。事实上,在股市上,有操作价值的买点和卖点是有限的,成功的投资者绝不是每天频繁进行买进卖出操作的那一部分人,而是能够看准时机,只在股票信号最明显、最有把握的时刻出手的投资者。本文运用上证综合指数的基础数据建立决策树偏模型。由于K线图操作理论相对完善,为了便于将模型输出结果和已有理论进行对比,本文将股市每日开盘价、收盘价、最高价、最低价4个基础指标转换成上影线长、下影线长、箱长、箱色4个指标并以这4个指标为输入变量,以10日后股票涨跌情况为输出变量。用R软件(版本3.0.2)建立决策树模型后进行筛选,把拟合正确率最高的7片树叶集中到一起,发现：若同时具有孕线组合和双针探底,股价上升；若只具有双针探底,则若探底针较长(=9.65),股价也上升；若探底针不明显,未来不详；若只具有孕线组合,单从基础数据来看,未来不祥。“孕线组合”和“双针探底”是人们已经做出的关于K线图形态特点含义的经验总结,分类树偏模型的初步探索与经验总结基本吻合。决策树偏模型是从模型输出结果角度考虑的偏模型。它的本质是只接纳了模型结果的一部分而不是全部。进一步的,本文在决策树偏模型的基础上对偏模型概念进行了扩展。股市可供操作的买点和卖点有限,只有当股价信号明朗(无论是上升还是下降)时,才有必要进行预测。基于这一思路,支持向量机偏模型旨在找到可以用其进行预测的最佳数据环境。这是从模型输入角度考虑的偏模型。具体来说,如果我们不加选择的运用训练数据建立SVM模型并进行预测,效果并不好,SVM偏模型则是在用训练数据集A建立模型M1之后,挑选M1中拟合正确的数据记录,记作集合B,再用集合B建立模型M2；然后用分类树寻找并归纳集合B中数据记录的共同点,记作K,用模型M2仅预测验证数据中具有特点K的数据记录。也就是说,只有具有特点K的数据记录才有资格成为模型M2的输入。在建立SVM偏模型之前,本文运用方差分析的方法证明不同数据输入建立的SVM模型,在拟合优度方面的确有显著不同。将2011年1月20日——2014年2月18日的735条数据进行分组,每50条数据为一组,共有14组数据,对这14组数据进行三组对比实验,第一组实验,每组数据里的每条数据都会作为建模对象；第二组实验,每组数据仅选择前30条数据作为建模对象；第三组实验,每组数据仅选择前20条数据作为建模对象。在三组数据输入方式建立的模型的拟合度没有显著差别的原假设下,P值近似为0,可认定否定原假设,同一时间段内的不同的数据输入的确可导致完全不同的拟合优度。在初步验证了决策树偏模型的实用性和支持向量机偏模型的合理性之后,本文利用这两种偏模型寻找股票市场上的投资规律。在第五章中,运用决策树偏模型,’以“昨日箱长、昨日箱色、昨日下影线长、今日箱长、今日箱色、今日下影线长、DIF、DEA、DIF-DEA"为输入变量,以“10日后股票涨跌”为输出变量,找到拟合正确率为80%以上的9片树叶,并把这9片树叶所揭示的规则应用于验证数据,发现其中的32号、11号、132号、266号规则,均达到100%的预测正确率。而将这些规则进行整理、综合以后,发现它们实际上是：若DIF-DEA-1.85,股价预测会下跌；若DIF-DEA11.05,股价预测会上涨；若-1.85DIF-DEA11.05,股价未来趋势不明朗。在股市技术分析的历史资料中,有当“DIF0且DEA0时,DIFDEA,股价会上涨；当DIF0且DEA0时,DIFDEA,股价会下跌；当DIF0且DEA0时,DIFDEA,股价会上涨；当DIF0且DEA0时,DIFDEA,股价会下跌”的技术总结,可以看出,本文决策树偏模型的结论实际上是在此总结的基础上给出了更确切的数值区间。本文认为,模型结果对区间要求更为严格(不再以0为分界线,而是以-1.85和11.05为分界线),可能是投资者心理原因所致：当股市略有反弹时,大多数股民仍会处于观望状态,不会轻易出手,反而导致未来不明朗。只有股市的反弹达到一定程度,股民才会相信春天已来,出手买入,未来股价上升。反之亦然。在建立支持向量机偏模型时,首先对训练数据进行建模,建模后将拟合正确的数据集中到一起再次建模,并寻找它们的共同规律,将这些规律分别记作G1、G2、G3……；然后将验证数据中符合规律G1,G2,G3……的记录筛选出来,用再一次建立起来的模型进行预测,计算预测正确率。按此思路,从拟合正确的验证数据身上找到了4条共同规律：它们基本上都是在下影长前、DIF、DIF-DEA三个指标上具有某种共同点。把验证数据中符合这4条规律的数据筛选出来进行预测,正确率分别为57.1%、46.1%、72.7%、75%。平均数明显高于不加处理、直接使用训练数据建模,验证数据验证时的正确率55.5%。进一步证明了存在适合使用SVM模型进行预测的数据环境,仅在这种环境来临时进行预测比不加选择不分时机的盲目预测效果要好得多。传统的经典统计学总是首先给出符合经济理论的一组变量,事先指定这组变量的相互关系,然后在事先构筑好的框架中进行各种回归分析,是一种“先理论,后数据”的思考模式。而数据挖掘技术则打破这种常规,它并不事先给定任何“应该是什么”的理论束缚,而是把话语权完全的交给数据本身。可以说,它是一种“先数据,后理论”的思考模式。正因如此,本文大胆地在没有详尽数学推导的情况下讨论了偏模型的概念。本文不仅提出了偏模型的概念,还扩展了偏模型的概念：在利用数据挖掘技术处理数据时,或数据输入、或数据处理、或结果输出,在整个模型建立的过程中,只要有一个环节不是整体的被采纳,我们就称这样的模型为数据挖掘偏模型。分类树偏模型是从“输出结果”的角度考虑的偏模型,支持向量机偏模型是在“数据输入”过程中的偏模型。未来,更多含义更多角度的偏模型有可能出现。笔者相信,越来越多的学者将会加入到对偏模型的讨论中来。
[Abstract]:China stock market has gone through 24 years of ups and downs, the bumps along the way, the stock market started late in China continue to carry out the exploration of the self, and constantly self denial in the rational regression. These 24 years, the face has not yet reached the weak efficiency of stock market, all experts and scholars have done a lot of features and the stock market stock market prediction research. The current research can be divided into two factions: fundamental analysis and technical analysis, fundamental analysis is that the stock price reflects the company's intrinsic value, focusing on the analysis of the choice of variables; technical analysis to the history of the opening price, closing price, the highest price, the lowest price and so on. As a predictor of future stock price fertile soil, pay attention to the establishment of the choice of methods of data processing methods and models. It should be said that the two major factions of different systems, each one has its own merits. But in any case, the stock market did not reach the weak China The validity is an indisputable fact, the stock price sequence is related to history, and the technical aspect analysis has its foothold. This article is subordinate to the technical analysis.
From the existing technical analysis methods, roughly the time sequence analysis method, fuzzy mathematics, chaos theory, data mining analysis technology. Data mining technology which is in recent years as a new data processing geometric growth appeared a large amount of data, it does not require the prior to be explored in the form of information. But let the data speak for themselves. The popular data mining decision tree, neural network, support vector machine, clustering analysis and so on. And every kind of technology itself and there are many algorithms. There is no doubt that in the face of large complex stock data, data mining technology is a good method at present, various scholars in mining technology. The research on the stock market data, mainly from algorithm design and improvement of mining technology, the stock market variable selection and processing, the use of stock data type and different The combination of mining models is studied and improved in several aspects. In this paper, data mining technology is also chosen as the starting point for the study of stock market. However, we try to explore and improve this technology from a totally new perspective, and put forward the concept of data mining partial model.
The data mining model of partial concept originally on classification tree specific model structure. The output classification tree model is the result of a tree with many leaves of the tree, every leaf it represents a knowledge representation, the number of leaves, there are many knowledge in practical expressions. Application of these knowledge representation using value is different: some leaves of knowledge tested, the prediction accuracy is very high, and some leaves of knowledge almost no use value, the prediction accuracy is very low. So if each leaf is viewed as a sub model, can for each child model prediction accuracy rate instead of the whole model to calculate the correct rate, to find that the accuracy of sub model and other low accuracy of sub model is built to give partial model Process. In fact, in the stock market, operating value of buying and selling points is limited, successful investor is not every day that some people frequently buy sell, but can only see the opportunity in the stock the most obvious signal, the most certain shots of investors.
In this paper the basic data of Shanghai Composite Index of decision tree based on partial model. Because the K-line theory of operation is relatively perfect, in order to facilitate the modeling results and theoretical comparison, the stock market daily opening price, closing price, the highest price, the lowest price of 4 basic indexes into line under the shadow of long, long box long, 4 boxes of color index and the 4 indicators as input variables, the stock price 10 days after output variables. Using R software (version 3.0.2) decision tree model was established after screening, the correct rate of fitting the highest 7 leaves together, found that: if both pregnancy line combination and the double needle bottom, stock prices rise; if only with double needle bottom, if the dip needle long (=9.65), stock prices also rise; if the dip needle is not obvious, the future is unknown; if only has single pregnancy line combination, from the basic data, the future pregnancy group ominous. " "Combined" and "double needle probing" are the experience summaries that people have made about the characteristics and meanings of K-line maps. The preliminary exploration and classification of tree classification models basically coincide.
The decision tree model is partial partial model considering the output results from the model point of view. It is the essence of a part only accepted model results but not all. Further, based on the decision tree model of partial partial model concept was extended. The stock market can buy and sell for only limited. When the stock price signal is clear (either up or down), it is necessary to predict. Based on this idea, the support vector machine model to find the best partial data environment with its prediction. This model is considered from the perspective of model input. Body, if we use the training data without choice the establishment of SVM model and forecast, the effect is not good, but SVM model is in the A M1 model was established using the training data set, choose M1 fitting the correct data records, denoted by the set B, and then set B to set up the model of M2; and After that, we use the classification tree to find and induce the common points of data records in set B, and record it as K. We only use model M2 to predict data records with characteristic K in validation data. That is to say, only the data records with characteristic K are eligible to be input to model M2.
Before the establishment of SVM model, this paper uses the method of variance analysis show that the SVM model established by different input data, the goodness of fit is significantly different. 735 January 20, 2011 - February 18, 2014 data packet, each of the 50 data as a group, a total of 14 sets of data, three groups of experiments on this the data of the 14 groups, the first group of experiments, each data in each data will be as the modeling object; second sets of experiments, each data only 30 data as the modeling object; third sets of experiments, each data only 20 data as the model. In the original hypothesis fitting up three group data input mode no significant differences in the degree, the P value of approximately 0 can be identified, we reject the null hypothesis, different input data at the same time it can lead to a completely different fitting goodness.
After a preliminary validation of the rationality and practicability of SVM decision tree model of partial partial model, this paper use the two partial model for investment rules on the stock market. In the fifth chapter, using the decision tree model "to" partial, long box color box yesterday, yesterday, yesterday today under the long shadow. Long box, color box today, long lines, today DIF, DEA, DIF-DEA as input variables, "10 days after the stock price as the output variables, find the fitting accuracy of 9 leaves above 80%, and the 9 leaves revealed the rules used to verify the data, find the No. 32, No. 11, No. 132, No. 266, reached 100%. The rate of correct prediction after finishing, these rules are integrated and found they are in fact: if DIF-DEA-1.85, forecast the stock price will fall; if DIF-DEA11.05, forecast the stock price will rise; if the future price of -1.85DIF-DEA11.05. The trend is not clear. In the stock market technical analysis of the historical data, when the DIF0 and DEA0, DIFDEA, the price will rise; when the DIF0 and DEA0, DIFDEA, the price will fall; when DIF0 and DEA0, DIFDEA, the price will rise; when the DIF0 and DEA0, DIFDEA, the price will fall. The technical summary, can be seen, the decision tree model is actually a partial conclusion based on summing up the given numerical interval more accurate. This paper argues that the model results of the interval is more strict (no longer in 0 as a dividing line, but as -1.85 and 11.05 as the dividing line), may be caused by investor psychological reasons: when the stock market rebounded slightly, most investors will still be in a wait state, not easily shot, but lead to future uncertainties. Only the rebound in the stock market to a certain extent, investors will believe the spring has come, buying, the future price rise. And vice versa.
In the establishment of support vector machine partial model, the first model of the training data, modeling after fitting the correct data together again and look for modeling, their common rules, these rules are denoted as G1, G2, G3.; then the authentication data in accordance with the rules of G1, G2, G3. records were screened out to set up again, the model prediction, prediction accuracy. According to this idea, from the verification of data fitting correctly found 4 common law: they are basically in the shadow long before DIF, with some common DIF-DEA three indicators to meet these 4 laws. The data were screened out were verified in the data, the correct rates were 57.1%, 46.1%, 72.7%, the average number of 75%. was significantly higher than that without treatment, the direct use of the training data modeling, data verification of the correct rate of 55.5%. further proved to exist for Using SVM model to predict the data environment, only in this environment comes for better prediction results than predicted blindly choose not timing so much.
The traditional classical statistics are always the first given a set of variables consistent with economic theory, the relationship between the pre specified set of variables, and then build a good frame in advance in a variety of regression analysis, is a kind of "theory first, after data" mode of thinking. The data mining technology to break the routine, it is not bound given any "what should be" theory, but the right to speak completely to the data itself. It can be said that it is a "first data, after the theory of" thinking mode. Because of this, this paper boldly in the absence of detailed studies are discussed exhaustively the concept of partial model in this paper. Not only put forward the concept of partial model, also extended the concept of partial model: using data processing technology in data mining, or input data, or data processing, or output, in the process of establishing the model, as long as there is a A link is not integral is adopted, we call this partial model for data mining model. Partial model classification tree model is partial output from the "results" point of view, the support vector machine model is partial partial model in data input in the process. In the future, partial model more meaning more angle there. I believe that more and more scholars will be added to the discussion of the partial model in the past.

【学位授予单位】：西南财经大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：F832.51

【参考文献】