基于网页块划分的Web文本分类算法研究与实现

发布时间：2018-08-13 13:05

【摘要】： 目前Internet已经成为人们获取信息的一个重要途径。随着Web信息的日益增长,如何在如此大量的数据中提取有用信息成为一个重要课题。为了能够有效地组织和分析海量的Web文本资源,针对Web文本的数据挖掘技术变得越来越重要。Web文本分类研究是Web文本挖掘中的一个重要研究内容。Web文本中存在噪音信息及其半结构化的特点,使得针对Web文本的分类技术与传统的纯文本分类技术有所差别。基于机器学习的文本分类技术由文本的表示、分类方法及效果评估三部分组成。向量空间模型是文档最常用的表示结构,特征选择和特征降维是影响该结构的两个主要因素。贝叶斯定理、支持向量机模型等机器学习方法常常用在文本分类器的构造过程中。大多数基于模板的商业网页包含与主题相关的内容块,以及诸如广告、导航栏、版权等噪音信息。这些噪音内容的存在影响了基于网页的信息处理领域,如信息检索、网页分类等。利用HTML网页中具有分块启发作用的一些特殊标记将网页分块,通过计算网页块在整个网页集中的出现频率判定其是否为噪音块,给出了一种网页分块算法ContentDiscoverer。实验表明,与同类算法相比,ContentDiscoverer具有更快的执行速度和更好的主题内容块识别效果。将ContentDiscoverer分块算法用在网页分类中,设计并实现了一个中文网页分类器。实验结果表明,进行网页块划分后,其分类的准确性有了较大的提高。
[Abstract]:At present, Internet has become an important way for people to obtain information. With the increasing of Web information, how to extract useful information from such a large amount of data has become an important issue. In order to effectively organize and analyze a large amount of Web text resources, The data mining technology of Web text becomes more and more important. The research of web text classification is an important research content in Web text mining. The classification technology for Web text is different from the traditional pure text classification technology. The text classification technology based on machine learning consists of three parts: text representation, classification method and effect evaluation. Vector space model is the most commonly used representation structure of documents. Feature selection and feature dimensionality reduction are the two main factors that affect the structure. Bayesian theorem, support vector machine model and other machine learning methods are often used in the construction of text classifier. Most template-based business pages contain content blocks related to the subject, as well as noise information such as advertising, navigation bars, copyright and so on. The presence of these noise content affects the field of web-based information processing, such as information retrieval, web page classification and so on. In this paper, we use some special tags in HTML web pages to divide web pages into blocks. By calculating the frequency of web page blocks appearing in the whole web page set, we determine whether they are noise blocks or not, and present a content Discovery algorithm. The experimental results show that the algorithm has faster execution speed and better recognition effect than the similar algorithms. A Chinese web page classifier is designed and implemented by using ContentDiscoverer block algorithm in web page classification. The experimental results show that the accuracy of the classification is improved greatly.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2007
【分类号】：TP391.1

【相似文献】