搜索引擎中网页分类和网页净化的研究与实现

发布时间：2018-07-06 13:01

本文选题：网页分类 + 主题型网页　；参考：《武汉理工大学》2013年硕士论文

【摘要】：随着社会的进步和科技的飞速发展,人们的生活已经越来越离不开网络了,为了满足人们日益增长的需求,海量的网页信息也随之产生了,如何从这些海量的信息中找到人们所需要的信息变得越来越困难,搜索引擎正是为了解决这一难题而产生的。用户浏览的网页可以从内容展现形式上分为目录型网页(hub)、主题型网页(topic)和图片型网页(picture)这三类,现在将视频型网页也归类为图片型网页。这三种类型的网页在展现形式上的差异,直接影响到信息提取的方法也有所不同,对于目录型网页,主要是提取中间的链接信息；而对于主题型网页则是提取主题内容；图片型网页则主要是图片和视频。如何能对网页进行快速、准确的分类是搜索引擎在预处理阶段必须完成的工作。现在的网页在分类上呈现出模糊化,许多目录型的网页中间包含着大量的说明性文字,使其看起来跟主题型网页又有几分类似,这对网页分类又是一个巨大的挑战。搜索引擎在预处理阶段最主要的目的是信息提取,由于网页是一种半结构化的数据,在信息的提取过程中充满着各种挑战。为了页面的内容丰富、布局美观还有商业因素的惨杂,使得网页一般都包含着无用的链接、广告信息、版权信息等。这些信息严重影响到了网页内容提取的准确度,进而影响到了返回给用户检索结果的准确性,因此在进行信息的提取过程中必须进行去噪处理。如何提高搜索引擎的搜索质量和搜索效率一直都是人们不断研究和努力的方向,本文正是在研究搜索引擎的预处理过程中,着重研究了网页分类和网页净化这两点,研究的主要内容有： (1)提出并实现了一种网页分类的方法,该算法主要是对目录型和主题型的网页进行分类,通过一组多特征的启发式的规则去甄别网页的类型,实验证明该算法在网页分类上具有良好的效果。 (2)采用网页进行分块的思想,通过观察统计网页的主题内容的特点,提出了通过计算该结构块对整个网页类型的支持率来判断该块是否为主题块,同时针对不规范的网页中主题内容离散的特点,针对性的采用了文本间相似度比较来判断该块是否为主题块,实验证明该算法是有效的。
[Abstract]:With the progress of society and the rapid development of science and technology, people's lives have become more and more inseparable from the network. It is becoming more and more difficult to find the information that people need from these huge amounts of information. Search engine is to solve this problem. The web pages viewed by users can be classified into three types: directory web page (hub), theme page (topic) and picture page (picture). Now video pages are also classified as pictorial pages. The difference of display form of these three types of web pages has a direct influence on the methods of information extraction. For directory pages, it is mainly to extract the middle link information, while for the topic pages, it is to extract the subject content. Photo-based web pages are mainly pictures and videos. How to classify web pages quickly and accurately is a task that must be completed in the preprocessing stage of search engines. At present, the classification of web pages is fuzzy, and many directory pages contain a lot of explanatory text in the middle, which makes them look similar to theme pages, which is a great challenge to the classification of web pages. The main purpose of search engine in preprocessing stage is to extract information. Because web page is a kind of semi-structured data, it is full of various challenges in the process of information extraction. In order to enrich the content of the page, the layout of beautiful and commercial factors, make the web pages generally contain useless links, advertising information, copyright information and so on. This information seriously affects the accuracy of web page content extraction, and then affects the accuracy of the retrieval results returned to the user. Therefore, the process of information extraction must be de-noised. How to improve the search quality and efficiency of search engines has been the direction of people's continuous research and efforts. In the process of preprocessing of search engines, this paper focuses on the two aspects of web page classification and page purification. The main contents of this paper are as follows: (1) A web page classification method is proposed and implemented. The algorithm mainly classifies web pages of directory type and topic type, and discriminates the types of web pages by a set of heuristic rules with multiple features. Experiments show that the algorithm has a good effect on the classification of web pages. (2) by using the idea of partitioning web pages and observing the characteristics of the subject content of web pages, In this paper, we propose to judge whether the block is a topic block by calculating the support rate of the structure block for the whole web page type, and at the same time, aiming at the discrete feature of the topic content in the non-standard web page. The comparison of text similarity is used to judge whether the block is a topic block or not. The experiment shows that the algorithm is effective.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】