当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于混合特征的中文文本分类研究

发布时间:2018-01-20 20:45

  本文关键词: 文本分类 特征权重算法 混合特征 支持向量机 出处:《东北大学》2012年硕士论文 论文类型:学位论文


【摘要】:随着信息技术的高速发展和互联网自媒体时代的到来,越来越多的信息以电子文本的形式存在于互联网上。从海量的网页文本信息中提取准确的、有价值的知识成为信息处理的一大目标。文本自动分类技术作为信息处理领域的研究热点,能够将文档自动按照类别进行组织和处理,较大程度的解决了信息资源的无序性,作为信息检索,信息过滤和搜索引擎等领域的技术基础,有着广泛的应用前景。 本文以垂直搜索领域的网页文本主题信息检索做为应用背景,将实现网页文本的精确主题分类作为主要任务,围绕垂直搜索对分类结果集的内容直达性要求更高的特点,设计并实现了基于混合特征的中文文本分类系统,有效的解决了传统网页文本分类结果集直达性能不强的问题。主要的研究内容包括网页结构化信息的获取机制、混合特征模型的建立方法、分类器的训练策略等。 在结构化信息的获取上,设计并实现了网页文本自动抽取方法,通过对网页结构的分析,有效过滤了网页中的广告、图片、超链接等噪声,抽取网页中包括标题和正文内容在内的纯文本信息。 在混合特征建模上,将文本信息进行了中文分词等自然语言处理,使用了特征降维算法取得特征词集,改进了特征权重赋值算法,完成了内容特征建模,并验证了改进算法对分类性能的优化能力;同时提出了由网页语言学特征和网络特征构成的页面特征集,通过统计归一化实现页面特征的建模,从而得到了本文的混合特征向量空间模型。 在分类器的训练策略上,引入了机器学习中有监督的分类思想,研究了支持向量机算法,采用了经参数优化的支持向量机算法对混合特征模型进行训练,获得了识别性能更好的主题分类器和页面过滤器。 本系统通过将主题分类器与页面过滤器级联实现了基于混合特征的中文文本分类系统。系统首先根据网页资源的网络地址获取网页资源信息,依靠算法从获取的网页信息中提取出特定的文本信息;然后基于获取的文本信息进行混合特征的模型建立和分类系统的构造;最后通过性能测试,证明了系统具有较高的分类精度和较强的页面过滤能力。
[Abstract]:With the rapid development of information technology and the arrival of Internet self-media era, more and more information exists on the Internet in the form of electronic text. As a research hotspot in the field of information processing, text automatic classification technology can automatically organize and process documents according to categories. As the technical foundation of information retrieval, information filtering and search engine, it has a wide application prospect. In this paper, the vertical search domain of web page text topic information retrieval as the application background, the realization of accurate topic classification of web text as the main task. A Chinese text classification system based on mixed features is designed and implemented around the characteristics of vertical search which requires higher directness of the content of the classification result set. It effectively solves the problem that the direct performance of the traditional text classification result set is not strong. The main research contents include the access mechanism of the structured information of the web page and the method of building the mixed feature model. The training strategy of classifier. In order to obtain the structured information, we design and implement the automatic extraction method of web page text. Through the analysis of the web page structure, we effectively filter the noise such as advertisement, picture, hyperlink and so on. Extract plain text information from web pages, including title and text content. In the hybrid feature modeling, the text information is processed by natural language such as Chinese word segmentation, the feature reduction algorithm is used to obtain the feature set, and the assignment algorithm of feature weight is improved, and the content feature modeling is completed. The ability of the improved algorithm to optimize the classification performance is verified. At the same time, a set of page features is proposed, which is composed of linguistic features of web pages and network features. The modeling of page features is realized by statistical normalization, and the mixed feature vector space model of this paper is obtained. In the training strategy of classifier, the supervised classification idea in machine learning is introduced, the support vector machine algorithm is studied, and the hybrid feature model is trained by parameter-optimized support vector machine algorithm. Theme classifiers and page filters with better recognition performance are obtained. This system realizes the Chinese text classification system based on mixed features by concatenating the topic classifier and the page filter. Firstly, the system obtains the web resource information according to the web address of the web resource. Based on the algorithm, the specific text information is extracted from the obtained web page information. Then the mixed feature model is built based on the obtained text information and the classification system is constructed. Finally, through the performance test, it is proved that the system has high classification accuracy and strong page filtering ability.
【学位授予单位】:东北大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1

【参考文献】

相关期刊论文 前7条

1 李晓黎,刘继敏,史忠植;概念推理网及其在文本分类中的应用[J];计算机研究与发展;2000年09期

2 刘群,张华平,俞鸿魁,程学旗;基于层叠隐马模型的汉语词法分析[J];计算机研究与发展;2004年08期

3 马玉春,宋瀚涛;Web中文文本分词技术研究[J];计算机应用;2004年04期

4 邓宏涛;中文自动分词系统的设计模型[J];计算机与数字工程;2005年04期

5 沈达阳,孙茂松,黄昌宁;汉语分词系统中的信息集成和最佳路径搜索方法[J];中文信息学报;1997年02期

6 孙茂松,左正平,黄昌宁;汉语自动分词词典机制的实验研究[J];中文信息学报;2000年01期

7 张茂元,卢正鼎,邹春燕;一种基于语境的中文分词方法研究[J];小型微型计算机系统;2005年01期



本文编号:1449419

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1449419.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户7c9a3***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com