面向Web站点的标签标识相关技术的研究与应用

发布时间：2018-12-18 03:11

【摘要】：近年来,随着互联网站点爆发式的增长,互联网信息相对用户已经过载,人们在浩瀚的互联网海洋中找到特定类型的站点成为一个巨大的挑战,如何将互联网的站点以一个整体进行有效的分类显得尤为重要。现在对网站分类的研究均是基于单标签分类的,或者是二分类或者是多分类。针对这种情况以及网站多主题的特性,本文提出了一种对网站进行多标签标识的系统。它是一种对现有网站以站点为单位自动进行多主题定位的系统。本文绪论部分简要介绍了网站标签标识的背景、意义,网站标识的研究现状以及本文主要的研究内容；然后介绍了网络爬虫技术,介绍了网页信息抽取及文本分类算法,介绍了多标签的算法及评价指标；其次是对网站多标记进行了三方面的讨论,将重点研究以下问题：一是如何分析网站结构并提取结构信息；二是如何定位网页内容类信息并提取正文；三是如何根据结构信息和正文信息对网站进行标签标识。本文将工作主要分为以下几个部分。 1、网站拓扑结构的回溯及结构特征抽取网站结构分为两种,一种是根据文件在服务器的存放位置来确定的物理结构,一种是网站的链接结构,然而这两种结构都不能较清晰的反应网站的层次关系。因此本文提出了一种网站拓扑结构回溯的方法来对网站的层次关系进行回溯。实验表明,该算法对于网站层次结构的回溯性能良好。 2、网页正文内容定位及正文内容抽取网站的信息大部分来源于网页的正文内容,因此如何将网页信息按照正文和噪声的形式进行分离显得很有必要。本文提出的改进DSE算法通过将DSE算法与正文内容文字与标点符号的统计规则相结合来实现正文提取。通过与DSE算法进行比较得出,改进后的DSE算法有令人满意的正文提取结果。 3、网站的标签标识系统针对类别特征样本不均的情况,本文提出了属性加权的方式,对ML-KNN算法进行特征样本加权,使得特征样本多的类别权重低,特征样本少的类别权重高,从而保证了因类别间样本不平衡导致的分类准确率低的问题。实验证明,属性加权的算法S-ML-KNN确实提高了分类准确率。
[Abstract]:In recent years, with the explosive growth of Internet sites and the relative overload of Internet users, it has become a great challenge for people to find specific types of sites in the vast ocean of the Internet. How to classify the Internet sites as a whole is particularly important. Now the research on website classification is based on single label classification, or two classification or multi-classification. In view of this situation and the feature of multi-topic, this paper presents a multi-label identification system for web site. It is an automatic multi-theme location system for existing websites. The introduction of this paper briefly introduces the background, significance, research status and main research content of website label. Then it introduces the technology of web crawler, the algorithm of web page information extraction and text classification, the algorithm of multi-label and the evaluation index. Secondly, there are three aspects of discussion on the multi-tags of the website, which focus on the following problems: first, how to analyze the structure of the website and extract the structural information; second, how to locate the content information of the web page and extract the text; Third, how to label the website according to the structure information and text information. This paper divides the work into the following parts. 1. The backtracking and structural feature extraction of website topology can be divided into two types: one is the physical structure determined according to the location of the file in the server, the other is the link structure of the website. However, neither of these two structures can clearly reflect the hierarchical relationship of the website. Therefore, this paper proposes a method of backtracking the hierarchical relationship of web sites. Experiments show that the algorithm has good backtracking performance to the hierarchical structure of the website. 2. Most of the information of the website comes from the text content of the web page, so it is necessary to separate the web page information according to the form of the text and the noise. The improved DSE algorithm proposed in this paper combines the DSE algorithm with the statistical rules of text and punctuation to achieve text extraction. Compared with DSE algorithm, the improved DSE algorithm has satisfactory text extraction results. 3, the label identification system of website aims at the situation that the class feature sample is uneven, this paper puts forward the method of attribute weighting, which makes the weight of the class with many feature samples low by weighting the feature sample to the ML-KNN algorithm. The class weight with less feature samples is high, which ensures that the classification accuracy is low due to the imbalance of samples between categories. Experimental results show that the attribute weighted algorithm S-ML-KNN does improve the classification accuracy.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】