维吾尔文文本分类研究及系统开发

发布时间：2018-05-31 14:26

本文选题：维吾尔文 + 文本分类　；参考：《新疆大学》2012年硕士论文

【摘要】：随着计算机与网络技术的快速发展，互联网得到了广泛应用。Web信息的快速增长给信息检索带来严峻的考验，大量信息的出现使我们从中寻找需求的信息难度加大。文本分类对处理杂乱信息起着关键而有效的作用，在信息检索，搜索引擎，数字图书馆管理等领域都有重要的应用。本文从维吾尔文的特点与书写规则出发，建立了（包含20类，每类300篇文本）规模较大的文本语料库。深入研究并仔细考虑维吾尔文的特点和语法规则，通过进行大量实验和人工审核建立了比较完整的停用词表。分析了词干提取对维吾尔文文本分类准确率和分类速度方面的影响。由于降低向量空间维数是文本分类中的一个很重要的问题，针对这一点本文利用维吾尔文的词法规则采用了词干提取方法，通过此方法不影响维吾尔文文本分类准确率的同时达到了很好的降维目的。采用词干提取方法以后，，将维25%左右。在特征提取方法中采用CHI统计特征选择方法，通过实验分析特征数目的多少对实验结果的影响，实验结果表明，选取原始特征的3%-5%，相对来说是个最佳特征。通过大量实验，分析了维吾尔文字拼写错误对维吾尔文文本分类的影响。实验结果表明，拼写错误对维吾尔文文本分类的影响不大，但在降低向量空间维数方面有一定的影响。较深入的研究了国内外广泛应用的KNN，朴素贝叶斯（NB），SVM等的分类算法，并通过这些算法对维吾尔文文本进行分类，分析了每一种算法在维吾尔文文本上的性能。最终把维吾尔语的特点和文本分类技术相结合，搭建了维吾尔文文本分类实验平台（维吾尔文文本分类系统）。
[Abstract]:With the rapid development of computer and network technology, the Internet has been widely used. The rapid growth of Web information brings a severe test to information retrieval. The emergence of a large number of information makes it more difficult for us to find the information we need. Text classification plays a key and effective role in dealing with messy information. It has important applications in the fields of information retrieval, search engine, digital library management and so on. Based on the characteristics and writing rules of Uygur language, a large text corpus (including 20 categories, 300 texts per class) is established in this paper. In this paper, the characteristics and grammar rules of Uygur language are deeply studied and carefully considered, and a complete stop word list is established by a large number of experiments and manual verification. The effect of stem extraction on the accuracy and speed of Uygur text classification is analyzed. Because reducing the dimension of vector space is a very important problem in text classification, this paper uses the lexical rules of Uygur to extract the stem. The accuracy of Uygur text classification is not affected by this method, and a good dimension reduction is achieved at the same time. After using stem extraction method, the dimension is about 25%. In the feature extraction method, the CHI statistical feature selection method is adopted, and the influence of the number of features on the experimental results is analyzed experimentally. The experimental results show that the selection of the original feature 3- 5 is relatively the best feature. Through a large number of experiments, this paper analyzes the influence of Uygur spelling errors on Uygur text classification. The experimental results show that spelling errors have little effect on Uygur text classification, but have a certain effect on reducing the dimension of vector space. In this paper, the classification algorithms of KNN, naive Bayesian support Vector Machine (SVM), which are widely used at home and abroad, are deeply studied, and the performance of each algorithm on Uygur text is analyzed through these algorithms. Finally, combining the characteristics of Uygur language with text classification technology, a Uygur text classification experimental platform (Uygur text classification system) is built.
【学位授予单位】：新疆大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【引证文献】