面向互联网文本的大规模层次分类技术研究
[Abstract]:With the development of information technology, Internet data and electronic data are increasing rapidly. In order to organize and manage mass text information on the Internet effectively, Internet text is usually classified according to the topic category hierarchy of tree or directed acyclic graph structure, and organized into a classification of thousands, even tens of thousands of categories. Catalog. Fast and fine network access control can be achieved by establishing a comprehensive and accurate Internet categorized catalog. In this process, large-scale hierarchical categorization studies how to accurately categorize Internet text into various categories in the category hierarchy. Class catalogue is the foundation of building a healthy and harmonious Internet environment, and is also the basis of information retrieval, green Internet access, network reputation management, security filtering and other network applications. These features make it very different from the traditional text classification problems and bring greater challenges in technology. Based on the analysis of related work, this paper mainly aims at the large-scale hierarchical classification system, the rare categories are common, and the classification learning is scarce. The main research contents and achievements are as follows: 1) The large-scale hierarchical classification problem is summarized, the definition of large-scale hierarchical classification problem is given, the solution strategy of large-scale hierarchical classification problem is analyzed, and the large-scale hierarchical classification problem is solved. Solution methods are classified, and on the basis of classification, various typical solving methods are introduced and compared. Finally, large-scale hierarchical classification problem solving methods are summarized and the applicability of various classification methods is pointed out. 2) Aiming at the huge scale of category hierarchy, a two-stage classification method based on candidate category search is studied. The problem of large-scale classification is reduced to a small-scale classification problem by searching for candidate categories related to documents to be classified in the category hierarchy. Then the classifier is trained according to the samples of candidate categories to classify documents. The computational complexity of the candidate search problem is analyzed. By reducing the set coverage problem to the candidate search problem, it is proved that the candidate search problem is NP-hard; furthermore, a heuristic candidate search algorithm based on greedy strategy is proposed, which proves that the greedy strategy used in the algorithm is a local optimal choice, and the algorithm is many. In the classification stage, according to the context information of candidate classes in the category tree, different candidate classes are distinguished by the ancestor classes. Finally, a two-stage classification method is implemented by combining the candidate search method and the ancestor assistant strategy to synthetically determine the document category. We adopt the number of pages in the simplified ODP Chinese directory. The experimental results show that the proposed candidate category search algorithm improves the accuracy of candidate category search by about 7.5% compared with the existing algorithms. On this basis, combined with the two-stage classification method at the class level, it achieves better classification results. Topic model mining document topic features, research on hierarchical classification method based on LDA feature extraction. In topic category hierarchy, a topic category usually contains a series of sub-topic categories, the topic features in the document can well reflect the category it belongs to, so we use LDA model to extract topic features and text. In order to reduce the high-dimensional sparse problem of text data, the document is transformed from word feature space to topic feature space. In addition, the sample data is grouped according to the category hierarchy to increase the training samples of rare categories. Finally, a top-down classification framework is proposed to train and predict the two classifiers based on the support vector machine (SVM) model which is suitable for small samples and high-dimensional pattern problems. Compared with the traditional text categorization method, the proposed method can effectively improve the classification performance of rare categories in Web subject catalog. 4) Aiming at the lack of corpus in the expert-compiled classification system, the unlabeled data classification method is studied. This paper combines category knowledge and topic hierarchy information to construct web query, searches relevant documents from various web data and extracts learning samples, finds classification basis for supervised learning, and learns classifier by combining hierarchical support vector machine. To solve the problem of noisy data in web search results, the following methods are adopted There are three ways to improve the effect of classification learning: 1) using category knowledge and category hierarchy information to construct web query, using node label path to generate query keywords; 2) using multiple data sources to generate samples, while searching relevant pages and documents from Google search engine, Wikipedia, and other two data sources to obtain comprehensive sample data; 3) knots; Finally, a hierarchical text categorization method based on unlabeled web data is implemented. The experimental results on ODP simplified Chinese catalog dataset show that the proposed method can obtain more complete feature sources for each category. It is close to the supervisory classification method with labeled training samples, but avoids manual labeling. 5) For social text classification objects, a user topic model UTM is proposed, which divides user interest into original interest and forwarding interest according to different generation methods of micro-blog. The user's original topic preference and forwarding topic preference are found respectively, and then the user's interest words are calculated. According to the user's interest words discovered by UTM model, the keyword marking and tag recommendation can be realized. We validate the performance of the UTM model on the Sina microblog data set, and the experimental results show that the performance of the UTM model is in micro-blog. In order to overcome the shortcomings of fine granularity of user interest words, a supervised production model, u LTM, is proposed. The model expresses user preferences as tags and topics, and builds a topic model for user tags. u LTM classifies user tags as categories. As an observer variable, it is introduced into the production model to discover the hidden topic patterns in micro-blogs by the unsupervised learning mechanism of the topic model. Subject feature distributions of user tags are discovered by supervised learning. Subject categories of micro-blog users are deduced, and the accurate classification of micro-blog users is finally realized. The experimental results show that the model is suitable for modeling and classifying the category labels with explicit subject meanings. In summary, the classification system of large-scale hierarchical classification is huge, rare categories are common, classification learning lacks annotated samples, and classification objects are socialized. Four characteristics, such as text evolution, are studied, including candidate category search, rare category classification, unlabeled data learning, social text modeling and other key technologies for large-scale hierarchical classification.
【学位授予单位】:国防科学技术大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 王义章;层次分类模型的构造及实现[J];计算机应用研究;1994年04期
2 陆彦婷;陆建峰;杨静宇;;层次分类方法综述[J];模式识别与人工智能;2013年12期
3 古平;罗志恒;欧阳源怞;;基于增量模式的文档层次分类研究[J];计算机工程;2014年01期
4 何力;丁兆云;贾焰;韩伟红;;大规模层次分类中的候选类别搜索[J];计算机学报;2014年01期
5 谭金波;;一种改进的文档层次分类方法[J];现代图书情报技术;2007年02期
6 古平;朱庆生;张程;庄致;;一种融合本体和上下文的自适应层次分类模型[J];北京理工大学学报;2009年10期
7 史铁林,王雪,何涛,杨叔子;层次分类诊断模型[J];华中理工大学学报;1993年01期
8 张金;王桥;陈卓宁;;基于规则动态解析的层次分类树控件[J];机械工程师;2007年01期
9 李文;苗夺谦;卫志华;王炜立;;基于阻塞先验知识的文本层次分类模型[J];模式识别与人工智能;2010年04期
10 高波;赵政;;文本层次分类系统的研究[J];计算机工程与应用;2006年11期
相关会议论文 前1条
1 周毅;江云亮;张铭;熊宇红;冯是聪;;基于“链接”层次分类的主题爬取[A];第二十四届中国数据库学术会议论文集(技术报告篇)[C];2007年
相关博士学位论文 前2条
1 何力;面向互联网文本的大规模层次分类技术研究[D];国防科学技术大学;2014年
2 祝翠玲;基于类别结构的文本层次分类方法研究[D];山东大学;2011年
相关硕士学位论文 前10条
1 朱丽;基于层次分类的病性分析[D];南京理工大学;2015年
2 张薇娟;基于模糊认知图的分步文本层次分类研究[D];天津师范大学;2008年
3 肖雪;中文文本层次分类研究及其在唐诗分类中的应用[D];重庆大学;2006年
4 孔照昆;中文文本层次分类方法研究及应用[D];扬州大学;2013年
5 王栋;基于SVM的分类方法在内容管理中的应用[D];西北大学;2006年
6 谷峰;中文网页层次分类研究[D];华侨大学;2007年
7 李慧;蛋白质功能预测的层次化分类方法研究[D];吉林大学;2010年
8 白振田;基于向量空间模型与规则匹配相结合的文本层次分类系统的研究[D];南京农业大学;2006年
9 蔺燕;西藏民族学院分层次分类型教学研究[D];西藏民族学院;2014年
10 章张;基于层次分类的网络内容监管系统中串匹配算法的设计与实现[D];南京理工大学;2004年
,本文编号:2250136
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2250136.html