隐藏型垃圾网页检测研究

发布时间：2018-07-28 07:59

【摘要】：垃圾网页是指网页制造者采用迷惑或欺骗搜索引擎的手段,使得网页在检索结果中的排名高于实际排名的行为。这种页面不仅影响搜索引擎检索的准确率和效率,也严重恶化了用户的搜索体验,被公认为互联网检索面临的最大挑战之一。在垃圾网页作弊技术中,隐藏型作弊具有隐蔽性、欺诈性和难以检测等特点,已成为垃圾网页检测中一个亟待解决的问题。本文综述了目前国内外隐藏型垃圾网页检测技术的研究现状,介绍了隐藏型作弊技术的类型和特点。总结归纳伪装型垃圾网页的现象,详细介绍伪装型垃圾网页的实现机理以及国内外针对隐藏型垃圾网页的检测技术。本文根据己总结的伪装型垃圾网页的七种现象,提出了基于类型的Cloaking检测算法,设计了伪装型垃圾网页的检测系统框架。该框架包括数据集获取、网页特征信息提取、Cloaking检测和文件管理四大模块。其中数据集获取模块对模拟搜索引擎爬虫和用户浏览器获取搜索结果进行了详细的介绍,网页特征信息提取模块对特定标签以及内容和链接特征的有效性进行了详细的分析,Cloaking检测模块实现已提出的Cloaking检测算法,选取朴素贝叶斯算法对复杂Cloaking进行分类检测,并与几种常见的分类算法进行实验结果对比。文件管理模块实现对系统文件的管理。本文构建了中文垃圾词汇库和伪装型垃圾网页的中文样本数据集,通过实验对伪装型网页检测算法进行验证,并对实验结果进行了详细的分析。
[Abstract]:Garbage web page refers to the web maker's use of bewildered or deceptive search engines to make web pages ranking higher than the actual rankings in the retrieval results. This page not only affects the accuracy and efficiency of search engine retrieval, but also seriously worsens the user's search experience. It is recognized as the biggest challenge facing Internet retrieval. In the spam web cheating technology, hidden cheating has the characteristics of concealment, fraudulent and difficult to detect. It has become a problem to be solved urgently in the detection of garbage web pages.
This paper summarizes the current research status of hidden spam web detection technology at home and abroad, introduces the types and characteristics of hidden spam technology, summarizes the phenomenon of disguised garbage web pages, introduces the realization mechanism of disguised garbage pages in detail and the detection techniques for hidden garbage web pages at home and abroad.
In this paper, based on the seven phenomena of disguised spam page, this paper proposes a type based Cloaking detection algorithm, and designs a framework for detection system of disguised garbage pages. This framework includes four modules: data collection, Web feature information extraction, Cloaking detection and file management. The data set acquisition module is used for simulation search. The search results of engine crawlers and user browsers are introduced in detail. The effectiveness of Web feature information extraction module on specific labels, content and link features is analyzed in detail. The Cloaking detection module implements the proposed Cloaking detection algorithm, and selects the naive Bayes algorithm to classify complex Cloaking. The experiment results are compared with several common classification algorithms. The file management module implements the management of system files.
In this paper, the Chinese garbage vocabulary database and the Chinese sample data set of disguised garbage web pages are constructed. The experiment is used to verify the camouflage web page detection algorithm, and the experimental results are analyzed in detail.
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092

【参考文献】