基于Python的维吾尔文文本聚类系统设计与实现
发布时间:2019-04-15 19:52
【摘要】:随着因特网的迅速发展,互联网的数据信息量越来越大。如何快速有效的获取,,管理和使用这些数据已成为数据挖掘的重要研究内容。文本聚类作为一个有效的管理和组织文本的工具,受到了越来越多的重视和研究。文本聚类技术可以在相当的程度上解决这些问题,不仅可以节省时间,并且可以提高效率。在信息检索,搜索引擎,数字图书馆管理等领域都有重要的应用。 本文首先以维吾尔文的特点出发建立了规模较大的文本语料库。从积累的文本库中构造一个初步的停用词表,为了达到降低特征空间的维数的目的,本文采用了词干提取方法。实验结果表明采用的词干提取方法可以减少了源特征维数的23%-25%。 其次,深入研究了K-means和GAAC聚类算法的优缺点。针对经典K-means算法对初始聚类中心过分依赖的不稳定性缺点,GAAC算法的时间复杂度高的缺点,研究出一种改进的K-means算法。从实验结果得知,本文提出的改进K-means算法是可行而且有效的。 最后应用这些算法实现了基于python的维吾尔文文本聚类系统。该系统包括预处理模块,文本表示模块,及聚类算法模块等三个主要模块。通过已开发的系统进行对比实验,验证了改进的K-means算法准确性,稳定性及时间复杂度低的性能。聚类效果表明该系统具有稳定的运行性能。
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【学位授予单位】:新疆大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【学位授予单位】:新疆大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 李文斌;刘椿年;陈嶷瑛;;基于特征信息增益权重的文本分类算法[J];北京工业大学学报;2006年05期
2 林鸿飞,马雅彬;基于聚类的文本过滤模型[J];大连理工大学学报;2002年02期
3 刘艳丽;刘希云;;一种基于密度的K-均值算法[J];计算机工程与应用;2007年32期
4 范小丽;刘晓霞;;文本分类中互信息特征选择方法的研究[J];计算机工程与应用;2010年34期
5 刘志勇;耿新青;;基于模糊聚类的文本挖掘算法[J];计算机工程;2009年05期
6 张文明;吴江;袁小蛟;;基于密度和最近邻的K-means文本聚类算法[J];计算机应用;2010年07期
7 潘大胜;;基于改进的K-means算法的文本聚类仿真系统[J];计算机仿真;2010年08期
8 庞剑锋,卜东波,白硕;基于向量空间模型的文本自动分类系统的研究与实现[J];计算机应用研究;2001年09期
9 赵康;陆介平;倪巍伟;王桂平;;一种基于密度的文本聚类挖掘算法[J];计算机应用研究;2009年01期
10 奉国和;;自动文本分类技术研究[J];情报杂志;2007年12期
相关硕士学位论文 前7条
1 韦鲁玉;基于Agent的个性化智能信息检索系统[D];哈尔滨理工大学;2007年
2 姚清耘;基于向量空间模型的中文文本聚类方法的研究[D];上海交通大学;2008年
3 郑韫e
本文编号:2458444
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2458444.html