基于规则和SVM的教育资源分类技术研究

发布时间：2018-10-10 15:48

【摘要】：随着网络技术的迅猛发展，各类网上信息资源的急速增长，大量的教育资源也涌现在网络中。网络教育资源越来越成为学生，教育科学工作者和家长获取重要的信息重要来源。但现有的搜索引擎在搜索信息时往往会大量的不相关或根本没用的内容，因此如何快速有效地获得有用的资源信息和从大量的信息资源中对教育资源进行分类是本文研究的重点，而文本自动分类技术是实现网络教育资源文本自动分类的关键技术之一。本文的主要研究内容如下： 1.对现有的网络教育资源现状进行分析及网络主体行为和需求进行分析，构建基础教育资源的分类体系。 2.针对目前存在大量的特征选择算法，为了能够适当地决定在特定的情况下使用有哪种算法，需要提出可以依赖或判定的标准。本文综述相关文献里的一些基本特征选择算法，通过对特征选择方法和算法进行实证比较，然后提出一种可以依赖或判定的标准。 3..教育资源之间存在着隶属关系和并列关系，本文根据这些关系将其构建为层次结构，探讨了HTML格式网页的主要结构特征(即title、Anchor Text、meta)对网页分类的影响，并提出了基于规则的分类方法，实验结果表明标题和锚文本等对网页分类有正面影响。 4.构建教育资源的分类器，本文首先介绍了SVM的基本理论知识，在传统SVM算法的基础上，针对非线性可分文本问题中outlier对分类结果的敏感性，提出了一种改进的多类SVM算法（Weighted Multi-Class SVM），实验结果表明该算法比多类SVM算法分类效果更好。 5.针对基于规则的分类算法查准率高，查全率低；改进的SVM算法查准率低，召回率高的问题，，本文提出了将这两种方法结合的方法，实验结果表明系统的分类效果和效率都得以提高。
[Abstract]:With the rapid development of network technology and the rapid growth of all kinds of online information resources, a large number of educational resources are also emerging in the network. Network education resources are becoming more and more important for students, educational scientists and parents to obtain important information. But existing search engines tend to have a lot of irrelevant or useless content when searching for information. Therefore, how to quickly and effectively obtain useful information and classify educational resources from a large number of information resources is the focus of this paper. The automatic text classification is one of the key technologies to realize the automatic text classification of network education resources. The main contents of this paper are as follows: 1. This paper analyzes the current situation of network education resources and the behavior and needs of network subjects, and constructs the classification system of basic education resources. 2. In view of the existence of a large number of feature selection algorithms, in order to be able to decide which algorithm to use in a specific situation, we need to put forward criteria that can be relied upon or judged. In this paper, we review some basic feature selection algorithms in relevant literature, and compare the feature selection methods and algorithms, and then propose a criterion that can be relied on or judged. 3. There are subordination and parallel relationships among educational resources. According to these relationships, this paper constructs them into a hierarchical structure and discusses the influence of the main structural features of HTML format web pages (i.e. title,Anchor Text,meta) on the classification of web pages. A rule-based classification method is proposed. The experimental results show that the title and anchor text have a positive effect on the classification of web pages. 4. To construct a classifier for educational resources, this paper first introduces the basic theoretical knowledge of SVM. Based on the traditional SVM algorithm, this paper aims at the sensitivity of outlier to classification results in nonlinear separable text problems. An improved multi-class SVM algorithm is proposed. The experimental results of Weighted Multi-Class SVM), show that the algorithm is more effective than the multi-class SVM algorithm. Aiming at the problems of high precision and low recall of rule-based classification algorithm, low precision rate and high recall rate of improved SVM algorithm, this paper proposes a method to combine the two methods. The experimental results show that the classification effect and efficiency of the system can be improved.
【学位授予单位】：新疆大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】