基于Python的维吾尔文文本聚类系统设计与实现

发布时间：2019-04-15 19:52

【摘要】：随着因特网的迅速发展，互联网的数据信息量越来越大。如何快速有效的获取，，管理和使用这些数据已成为数据挖掘的重要研究内容。文本聚类作为一个有效的管理和组织文本的工具，受到了越来越多的重视和研究。文本聚类技术可以在相当的程度上解决这些问题，不仅可以节省时间，并且可以提高效率。在信息检索，搜索引擎，数字图书馆管理等领域都有重要的应用。本文首先以维吾尔文的特点出发建立了规模较大的文本语料库。从积累的文本库中构造一个初步的停用词表,为了达到降低特征空间的维数的目的，本文采用了词干提取方法。实验结果表明采用的词干提取方法可以减少了源特征维数的23%-25%。其次，深入研究了K-means和GAAC聚类算法的优缺点。针对经典K-means算法对初始聚类中心过分依赖的不稳定性缺点，GAAC算法的时间复杂度高的缺点，研究出一种改进的K-means算法。从实验结果得知，本文提出的改进K-means算法是可行而且有效的。最后应用这些算法实现了基于python的维吾尔文文本聚类系统。该系统包括预处理模块，文本表示模块，及聚类算法模块等三个主要模块。通过已开发的系统进行对比实验，验证了改进的K-means算法准确性，稳定性及时间复杂度低的性能。聚类效果表明该系统具有稳定的运行性能。
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【学位授予单位】：新疆大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】