基于多示例的中文文本分类
发布时间:2018-03-16 23:15
本文选题:文本分类 切入点:多示例学习 出处:《南京大学》2012年硕士论文 论文类型:学位论文
【摘要】:随着信息技术的迅猛发展,互联网进入了信息爆炸的时代,海量的信息以指数级的速度增长。用户希望能够快速、准确地从海量信息中获取其关注的信息。在此需求的驱动下,信息的自动处理成为研究热点。搜索引擎、文本分类、信息过滤等相关的技术被广泛的应用。 自然语言文本是海量互联网信息的主要表现形态,文本的自动处理成为海量数据处理研究的核心内容。本文以中文文本的自动分类展开研究。 中文文本缺乏自然分词,而自动分词的错误会对分类精度产生较大影响。本文针对这一问题,提出一种无需分词,基于多示例学习的中文文本分类方法。该方法通过抽取文章中中文字符与后续一定数目的字符构造文章的多示例特征表示,再利用随机森林多示例分类方法及多示例转换分类方法进行中文文本分类。在对从BBS收集的语料库以及tc-corpus-train语料库上的实验表明,利用多示例学习来处理中文文本的自动分类在避免分词的情况下保证了比较高的精度,具有实用价值。
[Abstract]:With the rapid development of information technology, the Internet has entered the era of information explosion, the mass of information is growing exponentially. Users hope to get the information of their concern from the mass information quickly and accurately. Automatic processing of information has become a research hotspot. Search engine, text classification, information filtering and other related technologies are widely used. Natural language text is the main representation of mass Internet information, and automatic text processing becomes the core of mass data processing. This paper focuses on the automatic classification of Chinese text. The Chinese text lacks the natural participle, but the error of automatic segmentation will have a great influence on the classification accuracy. Chinese text classification method based on multi-example learning. This method constructs multi-example feature representation of articles by extracting Chinese characters in articles and a certain number of subsequent characters. Then we use the random forest multi-example classification method and the multi-example transformation classification method to classify the Chinese text. The experiments on the corpus collected from BBS and the tc-corpus-train corpus show that, Multi-example learning is used to deal with the automatic classification of Chinese text, which is of practical value because of its high accuracy in avoiding word segmentation.
【学位授予单位】:南京大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【参考文献】
相关期刊论文 前2条
1 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期
2 刘永丹,曾海泉,李荣陆,胡运发;基于语义分析的倾向性文本过滤[J];通信学报;2004年07期
,本文编号:1622108
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1622108.html