网页去噪音与分类算法研究
发布时间:2018-06-10 20:25
本文选题:网页分类 + 网页噪音 ; 参考:《华侨大学》2008年硕士论文
【摘要】: 随着互联网的快速发展,网络上信息的数量也在急剧增长。互联网给人们提供了大量信息,但同时也给人们快速准确的获取信息带来挑战。为了能有效地利用网页资源,我们需要对网页进行分类。 本文研究网页分类的关键技术,并对网页去噪音技术和分类算法进行深入探讨。 在网页预处理时,最关键的问题是去除掉网页中的噪音数据,将与网页内容无关的广告、导航条以及版权等信息尽量去除,以得到所需要的网页主题信息。我们在分析现有方法和网页制作特点的基础上,综合考率网页的结构、分块大小信息,设计并实现了一个基于块分析的、自动调整阈值的网页去噪音算法。 特征聚合算法考虑到词与词之间的联系,根据特征词的分类贡献将他们聚合为分布模式,并使用分布模式代替传统算法中单个词对应向量一维的方式,我们对特征聚合算法在本文分类系统的效果进行了测试,测试结果显示特征聚合算法对数据集偏斜问题有着很好的效果,并对分类器整体性能有所改进。 当前文本分类领域已经提出了很多分类算法,其中,KNN和SVM被认为是具有较好效果的两种,我们提出SVM-KNN算法,通过将KNN和SVM两种分类器进行结合,并通过分类预测概率的反馈和修正来提高分类器性能。 最后,在我们实现的中文网页分类实验系统中,我们对基于块的网页去噪音算法和SVM-KNN算法的实际效果进行了测试,实验结果证明了算法的有效性。
[Abstract]:With the rapid development of the Internet, the number of information on the network is also growing rapidly. The Internet provides people with a lot of information, but it also challenges people to obtain information quickly and accurately. In order to make effective use of web resources, we need to classify web pages. In this paper, the key technologies of web page classification are studied, and the noise removal technology and classification algorithm are discussed in depth. The most important problem is to remove the noise data from the web page and remove the information such as advertisement, navigation bar, copyright and so on, which is independent of the content of the page, so as to get the required information on the subject of the page. On the basis of analyzing the existing methods and the characteristics of web page making, we design and implement a new method based on block analysis by synthesizing the structure and block size information of the test page. The feature aggregation algorithm, considering the relationship between words and words, aggregates the words into a distribution pattern according to their contribution to classification. Using the distribution pattern instead of the one-dimensional method of single word corresponding vector in the traditional algorithm, we test the effect of feature aggregation algorithm in the classification system in this paper. The test results show that the feature aggregation algorithm has a good effect on the skew problem of data sets and improves the overall performance of the classifier. Many classification algorithms have been proposed in the field of text classification. Among them, KNN and SVM are considered to have better effect. We propose SVM-KNN algorithm, which combines KNN and SVM classifiers, and improves the performance of classifier by feedback and correction of classification prediction probability. In our experimental Chinese web page classification system, we have tested the actual effect of the block based web page de-noise algorithm and SVM-KNN algorithm. The experimental results show that the algorithm is effective.
【学位授予单位】:华侨大学
【学位级别】:硕士
【学位授予年份】:2008
【分类号】:TP393.092
【引证文献】
相关期刊论文 前1条
1 刘文静;许志伟;何聪慧;;WEB到WAP的转换过程中页面去噪问题的研究[J];计算机应用与软件;2012年04期
,本文编号:2004485
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2004485.html