连续隐马尔科夫模型在点击欺诈识别中的应用研究
发布时间:2018-05-31 23:01
本文选题:点击欺诈 + 连续隐马尔科夫模型 ; 参考:《上海交通大学》2013年硕士论文
【摘要】:随着搜索引擎关键词广告营销模式的蓬勃发展,欺诈点击行为已经成为困扰广告商和搜索引擎公司的一大难题。对于点击欺诈识别与防治问题的研究也成为国内外学者们关注的焦点。本文分析了搜索引擎在线关键词广告的点击欺诈(click fraud)问题及其行为特征。鉴于关键词广告对应的点击行为模式较为符合隐马尔科夫模型(HMM)的基本前提假设,,本文试图把HMM模型的理论框架应用于点击欺诈识别。 本文的工作主要有: (1)HMM只是一个理论框架模型。本文对关键词点击的行为模式进行了分析,搭建了针对搜索引擎关键词广告的连续隐马氏模型(CHMM),并创立了欺诈点击行为的识别方法; (2)根据观测数据,训练得到CHMM模型(参数估计),并对该模型的识别效果进行了验证。统计结果表明:CHMM模型对点击欺诈的识别有较高的准确率; (3)讨论了模型中的参数:隐状态数N、序列的长度R、以及阈值大小,选取不同值的情况下,模型的识别准确度。以确定最佳的隐状态数(固定值)和阈值等参数。 (4)由于时间段、突发事件等因素影响,可能导致某一在线关键词广告的点击率明显提升,但是这并不是欺诈点击造成的。本文采用动态的CHMM模型,不断更新用于训练的时间序列数据,以产生新的参数,可以很好的降低这类因素对模型识别准确度的影响。 (5)隐马尔科夫模型(HMM)的参数估计是其应用于识别问题时能否达到较高的准确率的关键。传统的Baum-Welch算法有诸多缺陷,基于SegmentalK-Means(SKM)的训练算法,与Baum-Welch算法相比,不仅可以降低运算的复杂度,收敛速度也较快,而且该算法更侧重于对模型的输出模式进行自动分类识别。因此,对点击欺诈识别问题,SKM算法更有针对性,适用性更强。实证分析也表明,SKM训练算法对于点击欺诈的识别效果更好。此外,本文初步探讨了基于MCMC的Gibbs抽样法的HMM参数估计方法。
[Abstract]:With the vigorous development of search engine keyword advertising marketing mode, fraudulent click behavior has become a major problem for advertisers and search engine companies. The research on click fraud identification and prevention has also become the focus of scholars at home and abroad. This paper analyzes the click Fraud-click problem of online keyword advertising in search engines and its behavioral characteristics. In view of the fact that the click behavior pattern corresponding to the advertisement corresponds to the basic premise hypothesis of Hidden Markov Model (hmm), this paper attempts to apply the theoretical framework of HMM model to click fraud identification. The main work of this paper is as follows: The hmm is only a theoretical framework model. In this paper, the behavior pattern of keyword click is analyzed, the continuous hidden Markov model for keyword advertisement is built, and the identification method of fraudulent click behavior is established. (2) according to the observed data, the CHMM model (parameter estimation) is obtained, and the recognition effect of the model is verified. The statistical results show that the 1: CHMM model has a high accuracy in the recognition of click fraud. (3) the parameters of the model are discussed: the number of hidden states N, the length of the sequence R, and the threshold value. The recognition accuracy of the model is obtained by selecting different values. In order to determine the best number of hidden states (fixed value) and threshold and other parameters. Due to the influence of time period, unexpected events and other factors, the click rate of an online keyword advertisement may increase obviously, but this is not caused by fraudulent click. In this paper, the dynamic CHMM model is used to continuously update the time series data used for training to produce new parameters, which can reduce the influence of these factors on the accuracy of model recognition. The parameter estimation of hidden Markov model (HMMM) is the key to the accuracy of HMMM when it is applied to the problem of recognition. The traditional Baum-Welch algorithm has many defects. Compared with the Segmental K-Means-SKM (Segmental K-Means-SKM) algorithm, the algorithm can not only reduce the computational complexity and the convergence speed, but also focus on the automatic classification and recognition of the output pattern of the model. Therefore, the SKM algorithm is more specific and applicable to the problem of click fraud identification. Empirical analysis also shows that SKM training algorithm is more effective in the recognition of click fraud. In addition, this paper preliminarily discusses the HMM parameter estimation method based on Gibbs sampling method based on MCMC.
【学位授予单位】:上海交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前7条
1 李苇营,易克初,胡征;神经网络与HMM构成的混合网络在语音识别中应用的研究[J];电子学报;1994年10期
2 袁健;张劲松;马良;;一种有效预防点击欺诈的策略[J];计算机应用;2009年07期
3 张祖莲;卡米力·木衣丁;王命全;;一种有效预防点击欺诈的算法[J];计算机应用;2010年07期
4 龚尚福;姜晓旭;;基于用户行为分析的广告欺诈点击检测[J];计算机应用与软件;2011年04期
5 高志坚;;引入第三方监测根治点击欺诈[J];生产力研究;2007年18期
6 欧海鹰;吕廷杰;;在线关键词广告研究综述:新的研究方向[J];管理评论;2011年04期
7 黄晓彬;王春峰;房振明;熊春连;;基于隐马尔科夫模型的中国股票信息探测[J];系统工程理论与实践;2012年04期
相关硕士学位论文 前4条
1 张喜良;拓展的隐马氏模型和基于遗传算法的参数估计方法[D];国防科学技术大学;2010年
2 张静亚;基于HMM的汉语连续数字语音识别[D];苏州大学;2005年
3 吴yN;在线广告点击欺骗的检测和应用[D];上海交通大学;2006年
4 舒正勇;商业搜索引擎的点击欺诈问题研究[D];辽宁师范大学;2008年
本文编号:1961677
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1961677.html