基于多示例多标签分类的Web文本挖掘研究

发布时间：2018-06-20 11:41

本文选题：多示例学习 + 最小二乘双支持向量机　；参考：《天津理工大学》2017年硕士论文

【摘要】：随着网络技术的飞速发展,互联网信息资源增长迅猛,对海量数据的分类提出了进一步的要求。文本分类作为文本挖掘最重要的研究方向,在现实生活中有着广泛的应用。研究如何对文本进行有效的表示、有效的查找信息成为现在文本挖掘领域迫在眉睫的研究课题。现实生活中多示例多标签文本大量存在,对文本分类研究提出了新的挑战。传统的文本分类基本是单示例单标签分类,无法对多语义、多类别的文本进行准确的处理,本文提出多示例多标签学习对多标签文本进行准确有效的分类。本文主要研究了以下几个方面的内容:(1)使用多示例多标签学习框架进行中文文本分类。多示例学习和多标签学习分别是针对语义歧义和多类别学习问题提出的,多示例多标签学习(MIML)主要针对图像分类、网页检索等的研究领域并取得了很好的成果,本文将多示例多标签学习(MIML)方法应用于中文文本分类,针对中文特有的结构及文本的多类别特征,改进MIML学习框架,使之更适用于中文文本分类,为中文文本分类提出了一种新的思路。(2)文本表示作为文本分类的一个关键步骤,对于后续分类器的学习性能有很大的影响。本文针对中文文本语义丰富的特点使用多示例句子包进行文本表示。目前主流的文本表示方法有VSM,这种方法以词作为文本切分粒度,对特征项进行了独立性假设,词间的语义信息丢失。针对语义缺失问题,本文引入多示例文本表示,使用多示例包对文本进行处理,使用句子作为文本表示的最小单位,使词间的语义信息得以保留。数据表示阶段使用多示例句子包的形式进行文本表示,避免基于语义独立性假设带来的语义损失,并进一步优化处理使其成为主题包,缩短了文本处理的时间。(3)在文本分类阶段使用改进的LSTSVM多标签分类器进行分类。对于使用多示例主题包表示的文本,基于退化策略将多示例多标签数据通过聚类处理成为单示例多标签学习,使用改进的最小二乘双支持向量机(LSTSVM)多标签分类器对文本进行分类。最小二乘双支持向量机把一个大型QPP问题转化成两个小型QPP问题,计算速度得到了提升并降低了计算复杂度。(4)根据改进的算法设计构造多示例多标签文本分类系统,使用reuter-21578新闻语料、Emotion数据集和同济大学的中文语料库数据集对改进的算法进行实验验证和结果分析,实验结果表明改进的算法在评价指标上优于目前存在的多标签分类算法。
[Abstract]:With the rapid development of network technology and the rapid growth of Internet information resources, the classification of massive data has been further required. Text classification, as the most important research direction of text mining, is widely used in real life. How to effectively represent text and find information effectively becomes an urgent research topic in the field of text mining. In real life, there are a lot of multi-example and multi-label text, which brings a new challenge to the research of text classification. Traditional text categorization is a single example and single label classification, which can not deal with multi-semantic and multi-category text accurately. This paper proposes multi-example multi-label learning to classify multi-label text accurately and effectively. This paper mainly studies the following aspects: 1) using multi-example multi-label learning framework to classify Chinese text. Multi-example learning and multi-label learning are proposed for semantic ambiguity and multi-class learning respectively. Multi-example multi-label learning (MIMLL) mainly focuses on image classification, web search and other research areas, and has achieved good results. In this paper, multi-example multi-label learning (MIML) method is applied to Chinese text classification, and the MIML learning framework is improved to make it more suitable for Chinese text classification. As a key step of text categorization, a new approach to Chinese text categorization is proposed, which has great influence on the learning performance of subsequent classifiers. In view of the rich semantic characteristics of Chinese text, this paper uses multiple sample sentence packets for text representation. At present, VSM is the main text representation method, which takes words as the granularity of text segmentation, and assumes the independence of feature items, and the semantic information between words is lost. In this paper, we introduce multi-sample text representation, use multi-sample packages to process the text, and use sentences as the smallest unit of text representation, so that the semantic information between words can be preserved. The data presentation phase uses multiple sample sentence packages for text representation to avoid semantic loss based on semantic independence assumptions and to further optimize processing to make it a topic package. The text processing time is shortened. 3) the improved LSTSVM multi-label classifier is used in the text classification stage. For text represented by multi-sample topic packages, multi-sample multi-tag data is clustered into single-sample multi-tag learning based on degradation strategy. An improved least squares double support vector machine (LSTSVM) multi-label classifier is used to classify text. The least square double support vector machine transforms a large QPP problem into two small QPP problems. The computational speed is improved and the computational complexity is reduced. (4) based on the improved algorithm, a multi-example multi-label text classification system is designed and constructed. Using the reuter-21578 news corpus and the Chinese corpus data set of Tongji University, the improved algorithm is verified and analyzed. The experimental results show that the improved algorithm is superior to the existing multi-label classification algorithm in evaluation index.
【学位授予单位】：天津理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP393.09

【相似文献】