TextGen:用于新型存储系统基准测试的真实文本数据集生成方法（英文）

发布时间：2019-04-15 19:11

【摘要】：新型存储系统通过内置数据压缩功能提高性能,并节省存储空间。因此,数据内容会显著影响存储系统基准测试结果。由于真实数据集规模庞大,难以复制到目标测试系统,并且大多数数据集由于隐私性无法进行共享。因此,基准测试程序需要人工生成测试数据集。为了保证测试结果的准确性,需要根据影响存储系统性能的真实数据集特征信息生成数据。现有方法 SDGen在字节级别上分析真实数据集内容分布特征,并以此生成数据集,因此能够保证内置字节级压缩算法的存储系统测试结果准确。但是SDGen并未分析真实数据集的词级别内容分布特征,因此不能保证内置词级别压缩算法的存储系统测试结果准确,本文提出了一种基于Lognormal概率分布模型的文本数据集生成方法Text Gen。该方法根据真实数据集的词切分结果建立语料库,分析语料库中词的分布特征,利用最大似然估计得到词分布的Lognormal模型参数,根据模型采用蒙特卡洛方法生成数据内容。该方法生成数据集所消耗的时间只与生成数据集规模相关,具有线性的时间复杂度O(n)。本文收集了四种数据集验证方法有效性,并通过一种典型的词级别压缩算法——ETDC(End-Tagged Dense Code)进行测试。实验结果表明:相比SDGen,Text Gen生成文本数据集性能更高,并且,生成数据集用于压缩测试后与真实数据集的压缩速率、压缩率相似程度更高。
[Abstract]:The new storage system improves performance and saves storage space through built-in data compression. Therefore, the data content will significantly affect the storage system benchmark results. Because of the large scale of the real data set, it is difficult to copy to the target test system, and most data sets cannot be shared because of privacy. Therefore, benchmark programs need to generate test data sets manually. In order to ensure the accuracy of the test results, it is necessary to generate data according to the real data set characteristic information that affects the performance of the storage system. The existing method SDGen analyzes the content distribution characteristics of the real dataset at the byte level and generates the data set so that the test results of the storage system of the built-in byte-level compression algorithm can be guaranteed to be accurate. However, SDGen does not analyze the word-level content distribution characteristics of the real data set, so it can not guarantee the accuracy of the storage system test results of the built-in word-level compression algorithm. In this paper, a text dataset generation method Text Gen. based on Lognormal probability distribution model is proposed. This method builds a corpus according to the word segmentation results of the real data set, analyzes the distribution characteristics of words in the corpus, obtains the parameters of the Lognormal model of the word distribution by using the maximum likelihood estimation, and generates the data content by the Monte Carlo method according to the model. The time consumed by this method is only related to the size of the data set, and it has a linear time complexity O (n). In this paper, four kinds of data set verification methods are collected and tested by a typical word-level compression algorithm-ETDC (End-Tagged Dense Code). The experimental results show that the performance of generating text dataset is higher than that of SDGen,Text Gen. Moreover, the compression rate of generated dataset is higher than that of real data set after compression test, and the compression ratio is higher than that of real data set.
【作者单位】： School
【基金】：Project supported by the National Natural Science Foundation of China(Nos.61572394 and 61272098) the Shenzhen Funda mental Research Plan(Nos.JCYJ20120615101127404 and JSGG20140519141854753) the National Key Technologies R&D Program of China(No.2011BAH04B03)
【分类号】：TP333

【相似文献】