基于SVM的web分类方案设计与研究

发布时间：2018-06-16 04:51

本文选题：网页分类 + 文本分类　；参考：《北京邮电大学》2014年硕士论文

【摘要】：近年来,web己经迅速发展成为了全球数据量最大的公共信息源,如何从浩瀚的信息中方便快捷的定位和筛选用户需要的信息,已经成为迫切需要解决的难题,其核心问题是web的自动分类。Web的文本分类来源于web分类,是文本挖掘的主要组成部分。按主题对web进行分类,建立分类结果数据库,生成分类信息资源,一方面可以为定制分类信息目录,实现网页分级管理和用户上网信息推荐,有效提高用户的搜索效率,快速、准确的定位到目标网页；另一方面还可以根据不同用户的类别兴趣特征,实现个性化定制,过滤不良网页和无关网页,按照用户的意愿实现web访问控制。目前主流技术都是web文本分类,主要通过设计合理的网页表示方式和文本分类算法实现web自动分类。 web文本自动分类的算法有很多,但是支持向量机(SVM)分类算法是当今最流行,分类效果最好的算法之一。本论文设计了一套完整的基于SVM的web分类方案,并基于该分类方案设计与实现了一个自动网页分类系统,结合样本数据进行实验,利用分类结果对系统进行测试评估,验证了该分类方案的可行性,同时也得到了一个高效的自动网页分类系统。本论文主要目标是提出一套完整的基于SVM的web分类方案,并基于该方案设计实现一个自动网页分类系统,该系统是基于B/S架构,利用LAMP (linux+apache+mysql+php) web平台开发,选择SVM分类器分类的一个自动分类系统。本论文主要完成了以下几个方面的工作：首先,对网页分类技术的课题背景、课题任务、论文结构进行了分析和总结。其次,系统地分析和研究了网页自动分类过程中的关键技术和相关理论,包括数据获取、数据预处理、SVM分类器等。其中数据预处理又包括网页去噪、文本分词、特征选择、特征量化等文本分类的预处理技术,分类算法主要分析和研究了KNN和SVM,通过比较KNN和SVM的性能,最终选择SVM算法作为本系统的分类算法。再次,详细介绍基于SVM算法的网页分类方案的设计与研究,包括架构设计和详细设计。架构设计是以web分类流程为基础进行的设计,包括需求分析、实现目标、开发环境和总体设计；详细设计是基于模块划分的思想,将系统划分为数据库模块、用户交互模块和分类模块,各个模块再进行详细具体的设计。然后,给出了一个基于SVM的web文本分类系统的实验并对实验结果进行分析,提出系统性能的优化。接着,提出了本文的创新点。在文本预处理阶段,为了提高色情、暴力、赌博、毒品等优先级比较高的类别的准确性,本文在分词之前对文本进行了预处理。首先抽取类别是色情、暴力、毒品等类熟语料,即知道相应类别的URL,经过页面解析,抽取标题内容,进行分词,计算词频,按降序排列,选择靠前出现的关键词组成一个预置关键词表。然后再对训练样本和预测样本进行页面解析,提取标题关键字,和事先设置好的关键词表进行对比匹配,匹配成功就给出相应分类号,匹配不成功就继续进行页面内容分词,提取特征,svm分类,最后得出分类结果。最后,对作者在硕士研究生期间的主要成果和本文的主要工作进行总结和展望。
[Abstract]:In recent years , web has been rapidly developed into the world ' s largest public information source , how to locate and screen the information needed by users conveniently and quickly from the vast information has become an urgent problem , and its core problem is the automatic classification of web .
on the other hand , the personalized customization can be realized according to the category interest characteristics of different users , the web access control can be realized according to the wishes of the user , and the present mainstream technology is the web text classification , and the web automatic classification is realized mainly by designing a reasonable webpage representation mode and a text classification algorithm .

There are many algorithms for the automatic classification of web text , but support vector machine ( SVM ) classification algorithm is one of the most popular and best classification algorithms . This paper designs a complete SVM - based web classification scheme , and designs and implements an automatic web page classification system based on the classification scheme . Based on the classification scheme , the feasibility of the classification scheme is verified , and a highly efficient automatic web page classification system is also obtained .

The main goal of this paper is to propose a complete SVM - based web classification scheme , which is based on the B / S architecture , which is developed by using the LAMP ( linux + apache + mysql + php ) web platform and selects an automatic classification system for SVM classifier classification .

The thesis mainly finished the following aspects :

Firstly , the thesis analyses and summarizes the subject background , task and paper structure of web page classification technology .

Secondly , the key technologies and relevant theories in the automatic classification of web pages are systematically analyzed and studied , including data acquisition , data preprocessing , SVM classifier , etc . The data preprocessing includes preprocessing technology of text classification such as webpage denoising , text segmentation , feature selection , feature quantization , etc . The classification algorithm mainly analyzes and studies KNN and SVM , and finally selects SVM algorithm as the classification algorithm of the system by comparing the performance of KNN and SVM .

Thirdly , the design and research of web page classification scheme based on SVM algorithm are introduced in detail , including architecture design and detailed design . The architecture design is based on the web classification process , including demand analysis , achievement goal , development environment and overall design .
The detailed design is based on the idea of module partition , divides the system into database module , user interaction module and classification module , and each module carries out detailed design .

Then , a web text classification system based on SVM is given and the experimental results are analyzed , and the optimization of the system performance is proposed .

In order to improve the accuracy of the categories of pornography , violence , gambling , drugs and so on , the text is preprocessed in the pre - processing stage of the text . First , the text is preprocessed in order to improve the priority of pornography , violence , gambling and drugs .

Finally , the author summarizes and prospects the author ' s main achievements during the master ' s graduate student ' s graduate student and the main work of this paper .
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】