基于Multi-Agent的分布式文本聚类模型

发布时间：2019-01-06 16:38

【摘要】：Internet网络大数据与日俱增,当前亟需设计出能够处理大规模半结构化和无结构化文本数据的新型聚类方法.现有工作的不足体现在:应用的文本集较为单一,对半结构和无结构的Web文本进行聚类的准确性较低,当文档规模较大时聚类的时效性无法得到保证.针对上述不足,提出新的基于群体智能的文本聚类模型Switch(a Swarm intelligence based text clustering algorithm),支持包括藏文、汉文、英文等多语言的文本聚类.基本思想为:构建文本的向量空间模型,借助自然语言处理和数据预处理技术得到由特征向量构成的文本集合;对群体智能文本聚类算法的参数进行初始化,不同智能体可以在二维文本空间上任意移动,计算其所在网格区域文本与其他样本的相似度,利用概率转换函数求取智能体拿起和放下样本的概率,进而实现文本聚类.提出分布式动态文本流聚类的multi-agent架构,将这一架构应用于群体智能文本聚类算法中,分布式工作环境被设计成相互通信的软agents集合,设计了相似度计算,智能体状态感知,文本解析三类智能体.通过解决智能体状态同步、处理器负载均衡和处理器之间通信的代价问题,将计算任务分成不同子任务,在多处理器上分布执行.此外,阐述了基于multi-agent的分布式群体智能文本聚类方法的工作原理,给出一种分布式通信架构,各种智能体相互通信,相互协作完成文本聚类工作.基于multi-agent通过JADE(Java Agent Development Framework)中间件实现集群上的分布式文本聚类,优势在于:分布式计算和大内存处理较单机具有更好的处理能力,借助JADE中间件能够使智能体间相互通信及协作,实现高效的文本聚类.在大量真实的半结构化包含藏文、汉文和英文多语言的Web文本数据集上进行实验,以藏文为例,实验结果表明:相比于k-means和单节点上的群体智能聚类算法,提出的分布式架构下文本聚类算法准确性平均高出12.2%和3.8%,时间代价平均缩减了73.0%和50.6%.在n个节点集群下agents数量介于150~250之间时,文本聚类时间代价近似可以达到单节点的1/n.
[Abstract]:With the increasing number of big data in Internet network, there is an urgent need to design a new clustering method which can deal with large scale semi-structured and unstructured text data. The shortcomings of the existing work are that the text set applied is relatively single, the accuracy of clustering semi-structured and unstructured Web texts is low, and the timeliness of clustering cannot be guaranteed when the document size is large. A new text clustering model (Switch (a Swarm intelligence based text clustering algorithm),) based on swarm intelligence is proposed to support text clustering in Tibetan, Chinese, English and other languages. The basic ideas are as follows: construct the vector space model of text and obtain the text set composed of feature vectors by natural language processing and data preprocessing technology; The parameters of the swarm intelligence text clustering algorithm are initialized. Different agents can move arbitrarily in the two-dimensional text space to calculate the similarity between the text in the grid region and other samples. The probabilistic transformation function is used to obtain the probability of the agent picking up and dropping the sample, and then the text clustering is realized. The multi-agent architecture of distributed dynamic text flow clustering is proposed. The architecture is applied to the swarm intelligence text clustering algorithm. The distributed working environment is designed as a soft agents set that communicates with each other. The similarity calculation and agent state awareness are designed. There are three kinds of agents for text parsing. By solving the problem of agent state synchronization, processor load balancing and communication between processors, computing tasks are divided into different sub-tasks and executed on multi-processors. In addition, the working principle of distributed swarm intelligence text clustering method based on multi-agent is described, and a distributed communication architecture is presented, in which various agents communicate with each other and cooperate with each other to complete text clustering. Based on multi-agent, distributed text clustering on cluster is realized by JADE (Java Agent Development Framework) middleware. The advantage of distributed computing and large memory processing is that distributed computing and large memory processing have better processing capability than single computer. With the help of JADE middleware, agents can communicate and cooperate with each other to achieve efficient text clustering. Experiments are carried out on a large number of real semi-structured Web text datasets containing Tibetan, Chinese and English languages. Taking Tibetan as an example, the experimental results show that compared with k-means and single-node swarm intelligence clustering algorithm, In the distributed architecture, the accuracy of the proposed text clustering algorithm is higher than that of the average of 12.2% and 3.8%, and the time cost is reduced by 73.0% and 50.6% on average. When the number of agents in n node clusters is between 150 and 250, the time cost of text clustering is approximately 1 / nnof that of a single node.
【作者单位】：成都信息工程大学网络空间安全学院成都信息工程大学管理学院华东师范大学数据科学与工程学院浙江大学计算机科学与技术学院西南交通大学信息科学与技术学院四川大学计算机学院
【基金】：国家自然科学基金(61772091,61165013,61363037) 教育部人文社会科学研究规划基金(15YJAZH058) 四川高校科研创新团队建设计划(18TD0027) 成都信息工程大学中青年学术带头人科研基金(J201701) 四川省科技计划项目(2018JY0448) 广西自然科学基金项目(2017JJD170122y)资助~~
【分类号】：TP391.1

【相似文献】