基于维基类目网络和URL模式树的网页分类方法探究

发布时间：2018-05-09 14:38

本文选题：网页分类 + 维基网络　；参考：《上海交通大学》2013年硕士论文

【摘要】：分类是信息检索中的一个重要问题，而网页分类对于提高互联网服务质量尤其意义重大。诸多互联网上的关键应用包括站点目录、搜索引擎、网页爬虫、推荐系统、用户行为分析系统和广告投放系统无不依赖于高效而准确的页面分类来提高服务质量。针对这些应用中涉及到的分类问题，有许多分类方法相继被提出，其中包括基于页面内容的文本分类方法。基于页面内容的分类方法依赖于正文质量，如果正文质量太差，或者文本长度太短，会导致分类性能的下降。随着一些大规模词典和类目体系的建立，基于第三方词库的分类方法引起了广泛的关注。第三方词库可以提供现成的语义类目，一方面可以作为辅助信息增强语义识别能力，提高分类的精度；另一方面可以直接用于分类，这样的分类方式能从一定程度上解决短文本的分类缺陷，，并且不需要依靠训练集，能高效地进行分类。本文的分类建立在全网环境，全网环境数据结构复杂、噪声多、干扰强，使用传统的分类方法，一方面如果文本质量太差，会大大影响分类的准确率；另一方面，全网数据量庞大，使用传统分类方法势必要通过引入大量训练集来训练分类模型，可能无法进行高效地分类。本文提出了一种基于维基网络的主题分类模型，词汇量和语义都极其丰富的维基类目网络涵盖了大量词汇，并且维基百科是在线实时编辑系统，很多词汇甚至能“与时俱进”，从而对全网范围的词汇有较好的覆盖。另外，这种分类方法不需要依赖训练集来训练模型，只要完成了维基网络的类目关联就可以用于分类预测。同时，尽管维基类目词汇实时变化，但是整个类目体系相对比较稳定，从而本方法可以在长时间内保持有效。我们在实验阶段对比了传统的基于页面内容的分类方法，证明本方案的可行性。另外，本文还创新性地提出了基于URL模式树的站点功能分类方法，基于URL模式树的功能分类借鉴了自然语言处理的语法树核函数（Tree Kernel）的方法，构造了URL语法规则和URL语法树，并通过改进的Tree Kernel来进行站点功能的分类。
[Abstract]:Classification is an important problem in information retrieval, and web page classification is of great significance to improve the quality of Internet service. Many key applications on the Internet include site catalogues, search engines, web crawlers, recommendation systems, user behavior analysis systems and advertising delivery systems, all of which rely on efficient and accurate page classification to improve the quality of service. Aiming at the classification problems involved in these applications, many classification methods have been proposed one after another, including text classification methods based on page content. The classification method based on page content depends on the text quality. If the text quality is too poor or the text length is too short, the classification performance will be degraded. With the establishment of some large-scale dictionaries and category systems, classification methods based on third-party lexicon have attracted wide attention. Third party lexicon can provide ready-made semantic categories. On the one hand, it can be used as auxiliary information to enhance semantic recognition ability and improve classification accuracy; on the other hand, it can be directly used in classification. This classification method can solve the problem of short text classification to some extent, and it can be classified efficiently without the need of training set. The classification of this paper is based on the whole network environment, the data structure of the whole network environment is complex, the noise is many, the interference is strong, using the traditional classification method, on the one hand, if the text quality is too poor, it will greatly affect the classification accuracy; on the other hand, Because of the huge amount of data in the whole network, the traditional classification method is bound to introduce a large number of training sets to train the classification model, which may not be able to classify efficiently. In this paper, a subject classification model based on Wikimedia is proposed. Wikimedia, which has abundant vocabulary and semantics, covers a large number of words, and Wikipedia is an online real-time editing system, and many words can even "keep pace with the times". In order to the whole network of vocabulary has a better coverage. In addition, this classification method does not need to rely on the training set to train the model, as long as the Wikimedia classification association is completed, it can be used for classification prediction. At the same time, although the wiki vocabulary changes in real time, the whole category system is relatively stable, so the method can be effective for a long time. In the experiment stage, we compared the traditional classification methods based on page content to prove the feasibility of this scheme. In addition, this paper also innovatively proposes a site function classification method based on URL schema tree. The function classification based on URL schema tree uses the method of natural language processing syntax tree kernel function to construct URL syntax rules and URL syntax tree. And through the improved Tree Kernel to carry out the classification of site functions.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【共引文献】