Web广告图片过滤技术研究与实现

发布时间：2018-04-20 00:32

本文选题：广告图片过滤 + SVM　；参考：《北京邮电大学》2017年硕士论文

【摘要】：自上世纪90年代互联网进入中国至今,我国互联网普及率已达到51.2%,网民规模已达到7.1亿,越来越多的人通过互联网发布或者获取信息。这么庞大的群体中,自然就蕴含了巨大的商机。Web网页上充斥着越来越多的广告,严重影响着大众对于有效信息的获取。而且进入Web2.0时代以来,图片由于具有更好的视觉效果,可以以更加简洁的形式蕴含更加丰富的内容特征,被越来越多的用于广告信息的传播,严重影响了大众的工作效率;目前针对广告图片的过滤研究已有很多,但多数研究都是通过研究图片的具体内容进行分类识别,虽然准确率较高,但图像识别难度较大,算法复杂。鉴于上述情况,本文对如何高效便捷的进行Web页面的上广告图片过滤进行了研究。所做工作如下:1.对广告图片的特征进行了归纳,分析了目前对于图片特征选择的优势与不足,并结合目前Web广告推崇个性化以用户兴趣为导向的特征,从兴趣、文本、链接、属性四个方面对Web广告图片进行特征提取。结合SVM机器学习算法提出了一个基于DOM属性的广告图片过滤模型。2.深入挖掘HTML文本的DOM属性,结合广告图片的特征以及目前基于用户兴趣的广告推荐情况,研究了基于DOM属性的广告图片过滤技术,避开了对图像内容的识别,提出了基于兴趣、文本、链接、属性四个方面共11个特征进行提取的方法,通过仿真实验,从准确率、精确率、召回率、F1测度四个方面验证了该模型的有效性。3.在对文本特征进行提取时,研究了目前常用的关键字匹配算法,对比了各关键字算法的优劣,考虑到本文所需匹配内容较为明确,选择了正向最大匹配算法进行关键字过滤。4.研究了 HTTP透明代理技术以及内容过滤技术,搭建了一个基于Squid-ICAP架构的基于DOM属性的广告图片过滤系统,详细介绍了系统的设计、关键功能模块的设计与实现。并对系统的过滤效果进行了验证。
[Abstract]:Since the entry of the Internet into China in the 1990s, China's Internet penetration rate has reached 51.2%, the scale of Internet users has reached 710 million, more and more people publish or obtain information through the Internet. In such a large group, there is a huge business opportunity. Web pages are filled with more and more advertisements, which seriously affect the public access to effective information. And since entering the Web2.0 era, because of the better visual effect, the picture can contain more and more content features in a more concise form, which is more and more used in the dissemination of advertising information, seriously affecting the efficiency of the public; At present, there are a lot of researches on image filtering, but most of them are classified and recognized by studying the specific content of the image. Although the accuracy is high, the image recognition is difficult and the algorithm is complex. In view of the above situation, this paper studies how to filter advertising images on Web pages efficiently and conveniently. The work to be done is as follows: 1. This paper sums up the features of advertising pictures, analyzes the advantages and disadvantages of feature selection for images at present, and combines the current Web advertising with personalized user-oriented features, from interest, text, links, etc. Attribute four aspects of Web advertising image feature extraction. Combining with SVM machine learning algorithm, this paper proposes an advertisement picture filtering model. 2. 2 based on DOM attribute. This paper deeply excavates the DOM attribute of HTML text, combines the features of advertisement picture and the current situation of advertisement recommendation based on user's interest, studies the technology of advertisement picture filtering based on DOM attribute, avoids the recognition of image content, and puts forward the interest based on it. The method of extracting 11 features from four aspects of text, link and attribute is presented. The validity of the model is verified from four aspects: accuracy, accuracy, recall rate and F1 measure. In the extraction of text features, the common keyword matching algorithms are studied, and the advantages and disadvantages of each keyword matching algorithm are compared. Considering the clear matching content needed in this paper, the forward maximum matching algorithm is chosen to filter the keywords. 4. The HTTP transparent proxy technology and content filtering technology are studied. An advertisement image filtering system based on DOM attribute based on Squid-ICAP architecture is built. The design of the system and the design and implementation of key function modules are introduced in detail. The filtering effect of the system is verified.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.09;TP391.41

【参考文献】