基于时间序列的决策树生成算法研究

发布时间：2018-07-05 01:38

本文选题：时间序列分类 + 正例-未标注学习　；参考：《西北农林科技大学》2017年硕士论文

【摘要】：时间序列数据是一类与我们的生活息息相关的高维数据,具有时间跨度大、实值有序、数据间存在自相关性等特点,其广泛存在于商业、医学、气象等领域。在众多时间序列分类算法研究中,相比于其他分类算法,基于决策树算法的时间序列分类算法有着很强的决策分析能力,且不基于正态统计分布假设,有较高的分类精度和鲁棒性。在以往基于决策树的时间序列分类研究中,处理结点分裂时,将时间序列的每一个时刻作为一个属性,按照时间轴一一对应,忽略了时间序列内部的自相关性和时间序列间不对齐性对属性选择的影响,即在时间序列内某个时刻的取值与其前后相邻的若干个时刻的取值相关,在时间序列间,同一时刻对应的数据的意义不同。针对传统决策树算法没有考虑到时间序列自相关性和不对齐性对算法影响的不足,本文在有监督学习和正例-未标注学习两个方面,讨论了新的基于序列对的时间序列决策树分类研究,具体成果如下:(1)有监督学习下的基于时间序列对的决策树生成算法研究。该算法从时间序列与其他序列相比所具有的自相关性和序列不对齐性的特点出发,提出序列熵的概念以代替传统决策树中所使用的信息熵,作为属性选择标准。并在属性值划分上,将序列对作为决策树分裂属性,基于动态时间规整距离(DTW)对时间序列集合进行划分,提出了基于决策树的时间序列分类算法(TSDT)。在此基础上,利用动态分类器集成技术,首先根据待测样本,使用最近邻算法在训练集合中构造验证集,通过随机构造的TSDT分类器在验证集上的分类性能,动态选择若干最优分类器,构造了时间序列动态集成分类模型(_En-TSDT)。在UCR时间序列公共数据集上的实验表明,相比于目前最强的分类器——基于动态时间规整技术的最近邻分类器,En-TSDT平均分类性能指标F1值提高了 1.47%,错误率下降了 9.80%。实验结果表明,基于序列熵和序列对信息增益的决策树算法,可有效克服传统决策树算法忽略时间序列数据自相关性和不对齐性的不足,提高了决策树算法在时间序列数据上的分类性能。(2)正例-未标注学习下的基于时间序列对的决策树生成算法研究。该算法在正例-未标注决策树算法(POSC4.5)基础上,将分裂属性扩展为以序列整体作为特征属性,由结点中正例集合和从未标注集合中挖掘到的负例集合随机组合成序列对,以分裂信息增益最大的序列对作为结点分裂属性,根据结点集合中样本与序列对的动态时间规整距离(DTW),对结点进行分裂,构造正例-未标注场景下的时间序列决策树(TSPOSC4.5)。其中负例集合是通过计算未标注集合中序列与正例集合间的距离,将与正例集合距离最远的序列作为负例,并通过最近邻方法在未标注集合中将挖掘到的负例的最近邻集合作为负例集合。在此算法的基础上,通过对参数估计多次计算取均值,降低了参数估计误差对分类性能的影响。并利用集成学习技术构造了正例-未标注时间序列集成决策树模型(En-TSPOSC4.5)。在UCR数据集中符合正例-未标注场景的16个数据集上,与目前最优的基于Markvo性质的正例-未标注时间序列分类模型PU Markvo和广泛应用的基于动态时间规整的正例最近邻算法相比,基于集成学习技术的En-TSPOSC4.5的分类性能指标F1值在不同正例标注比下分别平均提高了 4.95%和11.45%。结果表明,基于序列对的正例-未标注时间序列集成决策树算法有更强的分类性能。
[Abstract]:Time series data is a kind of high dimensional data which is closely related to our life. It has the characteristics of large time span, orderly real value, and autocorrelation among data. It widely exists in the fields of business, medicine, meteorology and so on. In the study of many time series classification algorithms, the time sequence based on the decision tree algorithm is compared to other classification algorithms. The column classification algorithm has a strong ability of decision analysis, and is not based on the hypothesis of normal statistical distribution. It has high classification accuracy and robustness. In the previous study of time series classification based on decision tree, each time sequence was taken as an attribute when dealing with node splitting, and the time sequence was ignored in accordance with the time sequence. The influence of the internal autocorrelation and the inhomogeneity between time series and time series on the selection of attributes, that is, the value of the value of the time series is related to the values of several adjacent moments in the time series. In the time series, the meaning of the data corresponding to the same time is different. In this paper, a new classification of time series decision tree classification based on sequence pairs is discussed in two aspects of supervised learning and untagged learning. The specific results are as follows: (1) research on the decision tree generation algorithm based on time series pairs under supervised learning. Compared with the characteristics of autocorrelation and sequence inhomogeneity, the sequence entropy concept is proposed to replace the information entropy used in the traditional decision tree as the attribute selection criterion. In the attribute value division, the sequence pair is used as the division attribute of the decision tree and the time sequence based on the dynamic time normalization distance (DTW). In line division, a time series classification algorithm based on decision tree (TSDT) is proposed. On this basis, the dynamic classifier ensemble technology is used. First, according to the samples to be measured, the nearest neighbor algorithm is used to construct the validation set in the training set, and a number of optimal classifiers are dynamically selected through the random construction of the TSDT classifier on the verification set. The time series dynamic integrated classification model (_En-TSDT) is constructed. The experiment on the public data set of the UCR time series shows that compared to the nearest neighbor classifier based on the dynamic time warping technology, the F1 value of the En-TSDT average classification performance index is increased by 1.47%, and the error rate is reduced by the 9.80%. experiment results. The decision tree algorithm for the information gain of sequence entropy and sequence can effectively overcome the shortcomings of the traditional decision tree algorithm ignoring the autocorrelation and inhomogeneity of the time series data, and improve the classification performance of the decision tree algorithm on the time series data. (2) a decision tree generation algorithm based on time series pairs under untagged learning. On the basis of the positive example untagged decision tree algorithm (POSC4.5), the split attribute is extended to the sequence whole as the feature attribute, and the negative example set excavated from the set in the node and the untagged set are randomly combined into sequence pairs, and the sequence pairs with the most splitting information gain are used as node splitting attributes and based on the set of nodes. The dynamic time normalization distance (DTW) of the medium sample and sequence pair, splitting the nodes and constructing the time series decision tree (TSPOSC4.5) under the untagged scene, which is a negative example by calculating the distance between the sequence of the untagged set and the set of the positive example, as a negative example, and through the nearest neighbor. Methods the nearest neighbor set of negative examples in untagged sets is used as a set of negative examples. On the basis of this algorithm, the influence of the parameter estimation error on the classification performance is reduced by calculating the mean value many times in the parameter estimation. And the integrated learning technique is used to construct an integrated decision tree model of the untagged time series (En-TSPOS C4.5). On the 16 datasets of the UCR data set that conforms to the untagged scene, the classification performance index of En-TSPOSC4.5 based on the integrated learning technology is compared to the current optimal Markvo based positive example of the untagged time series classification model PU Markvo and the widely applied dynamic time regularization based nearest neighbor algorithm, the F1 value of the classification performance index based on the integrated learning technique. The average increase of 4.95% and 11.45%. results on the different positive example annotation shows that the sequence pair unlabeled time series integrated decision tree algorithm based on sequence pairs has a stronger classification performance.
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】