工商网上违法广告智能识别关键技术研究与实现

发布时间：2019-03-17 15:25

【摘要】：随着科技的进步和社会的发展,网络经营和网上消费越来越受到广告经营者和消费者的青睐,互联网广告在经济社会领域中发挥着不可替代的作用。但是带来巨大便利的同时也带来了很多问题：虚假宣传、夸大疗效、保证治愈等误导、欺骗消费者的现象。因此对互联网广告进行有效的监督和监管具有非常重要的意义。本文面向工商监管领域,对网络违法文本广告智能识别的关键技术进行研究与实现。不同类别的违法广告有不同的处理方式,首先使用改进的文本分类算法对文本广告进行分类。通过挖掘维基百科知识,向文档中添加语义特征,改善向量空间模型的效果。然后基于扩充的维基百科语义特征,提出新的文档相似度计算方法,通过聚类过程为置信度高的未标注样本打上标记,以此来扩充标注样本的数量,提高广告文本分类效果。在违法广告的识别上,针对包含禁用词类型的广告,对传统的关键词匹配技术进行改进,提出基于上下文的逻辑关键词匹配技术。针对包含违法描述句子型的广告,结合广告文本较短以及语义缺失等特点,提出基于潜在概率语义分析的违法广告识别模型。实验表明,本文提出的算法可以提高违法广告识别的效果。设计并实现了工商违法广告智能识别系统。阐述了系统目标与总体设计,并介绍了违法广告识别模型的训练过程,系统数据的获取以及系统提供给用户的任务管理和违法报告管理平台。
[Abstract]:With the progress of science and technology and the development of society, network management and online consumption are more and more favored by advertising operators and consumers. Internet advertising plays an irreplaceable role in the economic and social fields. But it brings a lot of problems at the same time: false propaganda, exaggerating curative effect, ensuring cure and misleading, deceiving consumers. Therefore, the effective supervision and supervision of Internet advertising has very important significance. This paper focuses on the research and realization of the key technology of intelligent identification of network illegal text advertising in the field of industrial and commercial supervision. Different types of illegal advertisements have different processing methods. Firstly, the improved text classification algorithm is used to classify the text advertisements. By mining Wikipedia knowledge, semantic features are added to the document to improve the effect of vector space model. Then, based on the extended Wikipedia semantic features, a new method of document similarity calculation is proposed. Through the clustering process, the unlabeled samples with high confidence are marked, so as to expand the number of labeled samples and improve the classification effect of advertising texts. In the recognition of illegal advertisement, the traditional keyword matching technology is improved for the advertisement containing prohibited word type, and the context-based logical keyword matching technology is proposed. Aiming at the advertisement which contains illegal description sentence pattern, this paper proposes an illegal advertisement recognition model based on latent probability semantic analysis, which combines the characteristics of short advertisement text and semantic missing. Experiments show that the algorithm proposed in this paper can improve the effect of illegal advertising recognition. Design and implement the industry and commerce illegal advertising intelligent identification system. This paper expounds the target and overall design of the system, and introduces the training process of the identification model of illegal advertisement, the acquisition of system data, and the task management and illegal report management platform provided by the system to users.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】