当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于支持向量机的网页文本分类技术研究

发布时间:2019-06-21 09:18
【摘要】:随着互联网技术的飞速发展,网络上的网页信息成指数级增长。人们希望对网页进行快速分类,从而有效地获取有价值的信息。网页文本分类是实现快速信息检索的一项重要技术。目前,网页文本分类技术已经在数字图书馆、搜索引擎、新闻分类等应用领域得到了广泛的应用,具有重要的研究价值。 网页文本分类是以纯文本分类为技术基础的,文本表示常采用的方法是向量空间模型,而文本向量具有高维、稀疏性大等特征,大多数分类算法会出现维灾难。支持向量机(SVM)不仅有着扎实的理论基础,而且在处理高维数据的时候能有效地避免维数灾难,具有较好的泛化性能。因此,支持向量机是解决文本分类问题一个常用方法之一,在文本分类中有着很大的应用价值。本文主要的研究工作包括: 1、介绍了网页文本分类的研究背景和意义,以及文本分类在国内外的研究现状和网页文本分类技术的研究热点问题。对网页文本分类的相关技术进行了详细地分析,这些关键技术包括:网页文本预处理、网页文本表示方法、常用的特征选择方法、文本分类的几种评估标准和几种常见的文本分类技术。并深入地介绍了支持向量机的原理和技术。 2、提出了一种改进的权重计算方法。由于网页中不同标签内的特征项对于分类的影响是不同的,并且特征项在正文中出现的不同位置也有不同的语义特点,因此针对这些特征,,本文对网页特征进行了详细分析,并提出了一种根据HTML语义和特征项的位置对特征项进行加权处理的权重计算方法。通过实验表明,使用该改进方法来处理网页文本,最终能得到相对较好的分类效果。 3、目前支持向量机在处理大规模样本集时,会消耗大量的时间和过大的内存。针对这个问题,本文研究了支持向量机的特性,发现SVM的训练结果仅与支持向量有关,由此对支持向量机方法进行改进,提出了一种基于模糊聚类的两阶段支持向量机算法。该算法首先通过模糊C均值聚类算法对初始样本集进行约简,仅使用统一簇的中心点和混合簇中所有样本作为训练集。若该样本集包含有足够多样本,则仅对样本进行一次加权支持向量机训练,算法结束。若该样本集仅占原始样本的一小部分,则可能会因为丢弃了大量对分类有效的支持向量,极大地降低了分类的精度,因此依据第一阶段加权SVM得到的近似最优超平面,对靠近该超平面的聚类中心点解聚类。将解聚类后的样本和混合簇样本作为训练集,进行第二阶段的标准SVM操作,得到最终的最优超平面。通过实验表明,该方法基本保持了标准SVM的分类精度,并加快了训练速度。改进的分类方法在大规模的样本集上有着明显的优势。
[Abstract]:With the rapid development of Internet technology, the web page information on the network has become exponential growth. People hope to classify web pages quickly so as to obtain valuable information effectively. Web text classification is an important technology to realize fast information retrieval. At present, web text classification technology has been widely used in digital library, search engine, news classification and other application fields, and has important research value. Web page text classification is based on pure text classification. Vector space model is often used in text representation, and text vector has the characteristics of high dimension and sparsity, so most classification algorithms will have dimensional disaster. Support vector machine (SVM) not only has a solid theoretical basis, but also can effectively avoid dimension disaster when dealing with high-dimensional data, and has good generalization performance. Therefore, support vector machine (SVM) is one of the common methods to solve the problem of text classification, and it has great application value in text classification. The main research work of this paper is as follows: 1. The research background and significance of web text classification are introduced, as well as the research status of text classification at home and abroad and the research hot issues of web text classification technology. The related technologies of web text classification are analyzed in detail. These key technologies include: Web text preprocessing, web text representation, common feature selection methods, several evaluation criteria of text classification and several common text classification techniques. The principle and technology of support vector machine are introduced in detail. 2. An improved weight calculation method is proposed. Because the influence of feature items in different tags on classification is different, and the different positions of feature items in the text also have different semantic features, this paper analyzes the features in detail, and proposes a weighted calculation method of feature items according to the HTML semantics and the position of feature items. The experimental results show that the improved method can be used to deal with web page text, and finally, a relatively good classification effect can be obtained. At present, support vector machines consume a lot of time and memory when dealing with large sample sets. In order to solve this problem, this paper studies the characteristics of support vector machine, and finds that the training results of SVM are only related to support vector. Therefore, the support vector machine method is improved, and a two-stage support vector machine algorithm based on fuzzy clustering is proposed. Firstly, the fuzzy C-means clustering algorithm is used to reduce the initial sample set, and only the center point of the unified cluster and all the samples in the mixed cluster are used as the training set. If the sample set contains enough samples, only one weighted support vector machine training is performed on the samples, and the algorithm ends. If the sample set accounts for only a small part of the original sample, the classification accuracy may be greatly reduced by discarding a large number of effective support vectors for the classification. Therefore, according to the approximate optimal hyperplane obtained by the first stage weighted SVM, the clustering of the clustering center points near the hyperplane may be solved. The samples after de-clustering and the mixed cluster samples are taken as the training set, and the standard SVM operation in the second stage is carried out to obtain the final optimal hyperplane. The experimental results show that the method basically maintains the classification accuracy of standard SVM and accelerates the training speed. The improved classification method has obvious advantages in large sample sets.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092

【引证文献】

相关期刊论文 前1条

1 郭彦兵;;网页文本分类技术研究[J];科技创业家;2013年09期

相关硕士学位论文 前1条

1 薛晓冬;网络行为特征模型及在个性化服务中的应用[D];华南理工大学;2013年



本文编号:2503962

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2503962.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户d14e8***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com