垃圾博客检测及相关技术的研究
发布时间:2018-01-20 08:14
本文关键词: 特征关联树 组合特征 垃圾博客分类 统计特征 特征选择 出处:《辽宁师范大学》2012年硕士论文 论文类型:学位论文
【摘要】:近年来随着internet技术的发展,博客(Web blog)通过为作者和读者之间提供交互式交流平台和动态更新的社会网络而成为极受欢迎的一种新媒体的社会沟通机制。据调查科学研究、统计调查、公共建设、教育、社会福利等研究领域都会应用博客的分析结果,所以博客巨大的信息源和信息量具有极其宝贵的价值。但随之产生的垃圾博客(spam blog or splog)也肆意猖獗。它产生的主要方式是盗窃他人内容或机器自动生成,其目的是提高目标网站在搜索引擎中的排名以链接盈利广告。垃圾博客造成的问题包括:1)严重降低博客的检索质量;2)明显浪费网络和存储资源。因此,为保护博客世界的良好环境,必须对垃圾博客进行过滤。 首先本文根据博客的各种特征分析,提取了两种高效特征并结合传统的内容特征,采用特征组合的方法对博客进行分类。鉴于Yuuki Sato Takehito Utsuro对垃圾博客的统计规律以及对垃圾博客作者属性的分析,挖掘出博客的作者属性在博客分类中的重要性。这表明博客的作者属性具有十分重要的研究价值。博客作者常会无规律地发表博客,而垃圾博客为提高网页的点击率进而提高网站在ALEXA中的排名,须在短时间内发表大量的博文,同时机器生成垃圾博文的速度非常快。因此正常博客与垃圾博客在时间自相似特征上存在较大差异。本文根据文章中的作者属性和自相似特征的不同,对博客文章进行首次过滤,同时结合提取出的内容特征,增加特征之间的互补性,使垃圾博客过滤的效率大大提高。 其次,本文设计了一种针对垃圾博客特征筛选的特征关联树分类算法。该算法根据特征之间的相关性构造出一种特征关联树结构来筛选特征,剪枝掉不相关和冗余特征,保留强相关和弱相关特征,同时利用期望交叉熵对特征关联树进行二次筛选[2]。与传统的特征选择算法相比,该算法可以消除博客样本数据非平衡的影响,根据特征的相似度和期望交叉熵的大小,自适应地调整特征关联树的规模,降低特征维度。垃圾博客过滤的对比实验表明,该算法用于垃圾博客过滤时,可以获得较好的准确率和召回率。 本文提出的上述两种垃圾博客检测算法,均属于动态文本二分类算法。在分析传统的垃圾博客特征基础上,挖掘出检测垃圾博客的高效特征以及特征间的关联性,有效缩减了特征维度的规模,提高检测速度。经典分类器上进行对比实验测试,结果表明本文提出的垃圾博客检测算法具有良好的分类效果。
[Abstract]:In recent years, with the development of internet technology. Blogs become a popular social communication mechanism for new media by providing interactive communication platforms and dynamically updated social networks between authors and readers. Statistical surveys, public construction, education, social welfare and other areas of research will apply the results of the blog analysis. So blog's huge source of information and amount of information is extremely valuable. But the resulting spam blog spam or splog. It is also rampant. The main way it produces is to steal other people's content or machine to generate it automatically. The aim is to improve the ranking of target sites in search engines to link to profitable advertising. 2) waste of network and storage resources obviously. Therefore, in order to protect the good environment of blog world, spam blog must be filtered. Firstly, based on the analysis of various features of blog, two efficient features are extracted and combined with traditional content features. In view of the statistical rule of Yuuki Sato Takehito Utsuro and the analysis of the attribute of the author of spam blog, the method of feature combination is used to classify the blog. Excavate the importance of blog's author attribute in blog classification, which indicates that blog's author's attribute has very important research value. Bloggers often publish blog irregularly. The spam blog in order to improve the click rate of web pages and thus improve the ranking of the site in the ALEXA, must publish a large number of blog posts in a short period of time. At the same time, the speed of generating spam blog is very fast. Therefore, there is a big difference between normal blog and spam blog in time self-similar features. The blog articles are filtered for the first time, and the content features extracted are combined to increase the complementarity between the features, so that the efficiency of spam blog filtering is greatly improved. Secondly, this paper designs a feature association tree classification algorithm for spam blog feature filtering, which constructs a feature association tree structure to filter features according to the correlation between features. Pruning irrelevant and redundant features, retaining strong and weak correlation features, and using expected cross-entropy to filter feature correlation trees twice [2. Compared with the traditional feature selection algorithm, this algorithm can eliminate the unbalanced influence of blog sample data, and adaptively adjust the scale of feature association tree according to the similarity of features and the size of expected cross-entropy. The comparison experiment of spam blog filtering shows that the algorithm can obtain good accuracy and recall rate when it is used in spam blog filtering. In this paper, the above two spam blog detection algorithms, both belong to the dynamic text two classification algorithm, on the basis of analyzing the traditional spam blog features. Mining out the efficient features of detecting spam blog and the correlation between features, effectively reduce the size of the feature dimension, improve the speed of detection. Classical classifier on the comparative experimental test. The results show that the proposed spam blog detection algorithm has a good classification effect.
【学位授予单位】:辽宁师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092
【参考文献】
相关期刊论文 前10条
1 何海江;凌云;;由向量空间相关模型识别博客文章的垃圾评论[J];长沙大学学报;2008年02期
2 严超;王元庆;李久雪;张兆扬;;AdaBoost分类问题的理论推导[J];东南大学学报(自然科学版);2011年04期
3 王圆;孙铁利;李杨;;Web文本挖掘中的特征表示和特征提取[J];电脑知识与技术;2006年14期
4 苏丹;周明全;王学松;任玉芝;;一种基于最少出现文档频的文本特征提取方法[J];计算机工程与应用;2012年10期
5 严超;王元庆;;连续型Adaboost算法研究[J];计算机科学;2010年09期
6 兰均;施化吉;李星毅;徐敏;;基于特征词复合权重的关联网页分类[J];计算机科学;2011年03期
7 钟将;孙启干;李静;;基于归一化向量的文本分类算法[J];计算机工程;2011年08期
8 王博;贾焰;杨树强;韩伟红;;文本多分类中的特征选择研究[J];计算机工程与科学;2010年08期
9 崔自峰;徐宝文;张卫丰;徐峻岭;;一种近似Markov Blanket最优特征选择算法[J];计算机学报;2007年12期
10 秦进,陈笑蓉,汪维家,陆汝占;文本分类中的特征抽取[J];计算机应用;2003年02期
,本文编号:1447506
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1447506.html