面向互联网文本的大规模层次分类技术研究

发布时间：2018-09-19 12:36

【摘要】：随着信息技术的发展,互联网数据以及电子数据急剧增长。为了有效地组织和管理互联网上的海量文本信息,通常按照树型或者有向无环图结构的主题类别层次对互联网文本进行分类,将其组织为一个包含数千、甚至数万个类别的分类目录。通过建立全面、精确的互联网分类目录,可以实现快速、精细的网络访问控制。在这个过程中,大规模层次分类问题研究如何将互联网文本准确地分到类别层次中的各个类别。面向互联网文本的大规模层次分类技术是构建互联网分类目录的基础,是构建健康、和谐的互联网环境的重要技术手段,同时也是信息检索、绿色上网、网络信誉管理、安全过滤等网络应用的基础。与传统文本分类不同,大规模层次分类的分类体系规模巨大,缺少足够有效的训练语料,其分类对象以web文本为主,同时正向社会化文本演进。这些特征使其与传统的文本分类问题有很大差别,在技术上也带来了更大的挑战。本文在分析了相关工作的基础上,主要针对大规模层次分类的分类体系规模巨大、稀有类别普遍、分类学习缺少标注样本、分类对象向社会化文本演进等四个特性进行了研究,主要研究内容和成果包括:1)对大规模层次分类问题进行了综述。给出了大规模层次分类问题的定义,分析了大规模层次分类问题的求解策略;对大规模层次分类问题的求解方法加以分类,在分类的基础上,介绍了各种典型的求解方法并进行对比;最后总结了大规模层次分类问题求解方法并指出了各种分类方法的适用性。2)针对类别层次规模巨大的特性,研究了基于候选类别搜索的两阶段分类方法,通过搜索类别层次中与待分类文档相关的候选类别,将大规模分类问题降低为一个规模较小的分类问题,然后根据候选类别的样本训练分类器,对文档进行分类。首先对候选搜索相关概念进行定义并提出了候选搜索的量化评价指标;然后分析了候选搜索问题的计算复杂度,通过将集合覆盖问题规约到候选搜索问题,证明了候选搜索问题是NP难的;进一步提出了一个基于贪心策略的启发式候选搜索算法,证明了该算法采用的贪心策略是一个局部最优选择,并且该算法是多项式时间复杂度;在分类阶段,根据候选类别在类别树中的上下文信息,利用祖先类别区分不同候选类别。最后,结合该候选搜索方法和祖先辅助策略实现了一个两阶段分类方法,综合判断文档类别。我们采用ODP简体中文目录中的网页数据进行了实验论证,实验结果显示,相比已有算法,本文提出的候选类别搜索算法在候选类别搜索的准确率上提高了大约7.5%,在此基础上,结合类别层次的两阶段分类方法取得了更好的分类效果。3)针对稀有类别实例稀少的特性,利用LDA主题模型挖掘文档的主题特征,研究基于LDA特征抽取的层次式分类方法。在主题类别层次中,一个主题类别通常包含一系列的子话题类别,文档中的主题特征能够很好地反映其所属的类别,对此我们采用LDA模型进行主题特征抽取,将文档从词特征空间转化到主题特征空间,通过特征降维以减小文本数据的高维稀疏问题。另外,结合类别层次进行样本数据分组,以增加稀有类别的训练样本。由于LDA主题抽取的时间开销比较大,我们采用了层次式分类模型,以降低分类学习和预测的时间开销。最后,结合网页数据的特点,采用适合处理小样本、高维模式问题的支持向量机模型训练两类分类器,提出了一个top-down分类框架进行分类的训练和预测。我们在ODP简体中文目录上进行实验测试,同基于特征词的top-down分类方法相比,本文提出的方法能够有效提高web主题目录中稀有类别的分类性能。4)针对专家编制的分类体系缺少语料的问题,研究了无标记数据分类方法。传统的文本分类方法需要标注好的语料来训练分类器,但是人工标记语料代价昂贵。对此,本文结合类别知识和主题层次信息来构造web查询,从多种web数据中搜索相关文档并抽取学习样本,为监督学习找到分类依据,并结合层次式支持向量机进行分类器的学习。针对web搜索结果中含有噪声数据的问题,采用以下三个手段来提高分类学习效果:1)利用类别知识和类别层次信息构造web查询,采用节点的标签路径生成查询关键词;2)利用多数据源产生样本,同时从谷歌搜索引擎、维基百科这两个数据源搜索相关页面和文档,获取全面的样本数据;3)结合类别层次对样本数据分组,为每个类别获得更加完整的特征源,利用主题类别层次学习分类模型。最后实现了一种基于无标记web数据的层次式文本分类方法。我们在ODP简体中文目录数据集上进行实验测试,本文提出的方法在分类精度上接近于有标注训练样本的监督分类方法,但是避免了人工标注样本的工作。5)针对社会化文本分类对象,提出了一个用户主题模型UTM,根据微博的不同生成方式,将用户兴趣分为原创兴趣和转发兴趣进行分析;采用吉布斯抽样法对模型进行推导,分别发现用户的原创主题偏好和转发主题偏好,然后以此计算用户兴趣词。根据UTM模型发现的用户兴趣词,可以实现微博用户的关键词标记和标签推荐。我们在新浪微博数据集上验证了UTM模型的性能表现,实验结果表明在微博用户兴趣词标记上,其准确率高于已有方法。针对用户兴趣词粒度太细,不能有效实现用户分类的不足,随后提出了一个有监督的产生式模型u LTM,该模型将用户偏好表示为标签和主题,对用户标签进行主题建模。u LTM将用户标签类别作为一个观察变量,将其引入产生式模型,利用主题模型的无监督学习机制发现微博中的隐含主题模式,利用有监督学习发现用户标签的主题特征分布,然后推导微博用户的主题类别,最终实现微博用户的准确分类。我们在Twitter数据集上验证了u LTM模型在微博用户分类上的性能表现,实验结果表明该模型适合对主题含义明确的类别标签进行建模与分类。综上所述,本文针对大规模层次分类的分类体系规模巨大、稀有类别普遍、分类学习缺少标注样本、分类对象向社会化文本演进等四个特征,研究了大规模层次分类的候选类别搜索、稀有类别分类、无标记数据学习、社会化文本建模等关键技术,对于互联网文本信息的分类和主题挖掘工作具有重要的理论意义和应用价值。
[Abstract]:With the development of information technology, Internet data and electronic data are increasing rapidly. In order to organize and manage mass text information on the Internet effectively, Internet text is usually classified according to the topic category hierarchy of tree or directed acyclic graph structure, and organized into a classification of thousands, even tens of thousands of categories. Catalog. Fast and fine network access control can be achieved by establishing a comprehensive and accurate Internet categorized catalog. In this process, large-scale hierarchical categorization studies how to accurately categorize Internet text into various categories in the category hierarchy. Class catalogue is the foundation of building a healthy and harmonious Internet environment, and is also the basis of information retrieval, green Internet access, network reputation management, security filtering and other network applications. These features make it very different from the traditional text classification problems and bring greater challenges in technology. Based on the analysis of related work, this paper mainly aims at the large-scale hierarchical classification system, the rare categories are common, and the classification learning is scarce. The main research contents and achievements are as follows: 1) The large-scale hierarchical classification problem is summarized, the definition of large-scale hierarchical classification problem is given, the solution strategy of large-scale hierarchical classification problem is analyzed, and the large-scale hierarchical classification problem is solved. Solution methods are classified, and on the basis of classification, various typical solving methods are introduced and compared. Finally, large-scale hierarchical classification problem solving methods are summarized and the applicability of various classification methods is pointed out. 2) Aiming at the huge scale of category hierarchy, a two-stage classification method based on candidate category search is studied. The problem of large-scale classification is reduced to a small-scale classification problem by searching for candidate categories related to documents to be classified in the category hierarchy. Then the classifier is trained according to the samples of candidate categories to classify documents. The computational complexity of the candidate search problem is analyzed. By reducing the set coverage problem to the candidate search problem, it is proved that the candidate search problem is NP-hard; furthermore, a heuristic candidate search algorithm based on greedy strategy is proposed, which proves that the greedy strategy used in the algorithm is a local optimal choice, and the algorithm is many. In the classification stage, according to the context information of candidate classes in the category tree, different candidate classes are distinguished by the ancestor classes. Finally, a two-stage classification method is implemented by combining the candidate search method and the ancestor assistant strategy to synthetically determine the document category. We adopt the number of pages in the simplified ODP Chinese directory. The experimental results show that the proposed candidate category search algorithm improves the accuracy of candidate category search by about 7.5% compared with the existing algorithms. On this basis, combined with the two-stage classification method at the class level, it achieves better classification results. Topic model mining document topic features, research on hierarchical classification method based on LDA feature extraction. In topic category hierarchy, a topic category usually contains a series of sub-topic categories, the topic features in the document can well reflect the category it belongs to, so we use LDA model to extract topic features and text. In order to reduce the high-dimensional sparse problem of text data, the document is transformed from word feature space to topic feature space. In addition, the sample data is grouped according to the category hierarchy to increase the training samples of rare categories. Finally, a top-down classification framework is proposed to train and predict the two classifiers based on the support vector machine (SVM) model which is suitable for small samples and high-dimensional pattern problems. Compared with the traditional text categorization method, the proposed method can effectively improve the classification performance of rare categories in Web subject catalog. 4) Aiming at the lack of corpus in the expert-compiled classification system, the unlabeled data classification method is studied. This paper combines category knowledge and topic hierarchy information to construct web query, searches relevant documents from various web data and extracts learning samples, finds classification basis for supervised learning, and learns classifier by combining hierarchical support vector machine. To solve the problem of noisy data in web search results, the following methods are adopted There are three ways to improve the effect of classification learning: 1) using category knowledge and category hierarchy information to construct web query, using node label path to generate query keywords; 2) using multiple data sources to generate samples, while searching relevant pages and documents from Google search engine, Wikipedia, and other two data sources to obtain comprehensive sample data; 3) knots; Finally, a hierarchical text categorization method based on unlabeled web data is implemented. The experimental results on ODP simplified Chinese catalog dataset show that the proposed method can obtain more complete feature sources for each category. It is close to the supervisory classification method with labeled training samples, but avoids manual labeling. 5) For social text classification objects, a user topic model UTM is proposed, which divides user interest into original interest and forwarding interest according to different generation methods of micro-blog. The user's original topic preference and forwarding topic preference are found respectively, and then the user's interest words are calculated. According to the user's interest words discovered by UTM model, the keyword marking and tag recommendation can be realized. We validate the performance of the UTM model on the Sina microblog data set, and the experimental results show that the performance of the UTM model is in micro-blog. In order to overcome the shortcomings of fine granularity of user interest words, a supervised production model, u LTM, is proposed. The model expresses user preferences as tags and topics, and builds a topic model for user tags. u LTM classifies user tags as categories. As an observer variable, it is introduced into the production model to discover the hidden topic patterns in micro-blogs by the unsupervised learning mechanism of the topic model. Subject feature distributions of user tags are discovered by supervised learning. Subject categories of micro-blog users are deduced, and the accurate classification of micro-blog users is finally realized. The experimental results show that the model is suitable for modeling and classifying the category labels with explicit subject meanings. In summary, the classification system of large-scale hierarchical classification is huge, rare categories are common, classification learning lacks annotated samples, and classification objects are socialized. Four characteristics, such as text evolution, are studied, including candidate category search, rare category classification, unlabeled data learning, social text modeling and other key technologies for large-scale hierarchical classification.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP391.1

【相似文献】