Web挖掘技术及其在互联网中的应用研究
发布时间:2018-10-26 19:57
【摘要】:随着信息技术的不断发展,计算机与通信技术不仅推动着现代社会的信息化发展,而且同时影响并在改变着人们的现代生活。然而信息技术同时带来了数据的爆炸式增长,人们迫切需要一种对海量数据进行有效利用和处理的解决方案。在这样的大数据背景下,数据挖掘技术应运而生。Web挖掘技术作为该领域的一个分支,针对的是万维网海量数据的有效梳理和运用。由于互联网技术日新月异,而Web挖掘技术相对发展较晚,因此本文以Web挖掘作为研究核心,并深入分析其在互联网领域的应用。 本文首先介绍了Web技术的研究背景、现状、技术难点和未来发展方向等方面,以及对数据挖掘、机器学习等相关概念做了深入说明。然后,继续关注Web挖掘技术的实现过程和应用场景,介绍了文本预处理的核心实现过程和话题检测与追踪、用户行为分析两个应用的技术背景。 作为Web内容挖掘技术的一个重要应用之一,话题检测与动态追踪旨在检测未知话题并且追踪已有话题的后续发展。 针对网络媒介上新闻事件报道类文本对象的话题检测与动态追踪问题,本文实现了一种混合聚类解决方案。本方案基于“贡献度”对话题模型做了层次化调整,更加适合于构建互联网新闻话题,而且效率性能有了大幅提升。实际互联网新闻数据表明,与K-Means算法相比,本方案准确率和召回率有了显著提升,并且构建的话题树模型层次化效果明显。 针对中文微博类文本对象的话题检测与动态追踪问题,本文提出了一种基于主题词的增量式模糊聚类解决方案。本方案首先根据微博自身的文本特点,提出了一套信息反垃圾的过滤方案。然后利用时效性和词频两个因素,为主题词建立适应微博特点的权重。最后利用增量式模糊聚类方法完成突发话题的检测过程。实际微博数据表明,本方案可以有效地检测出突发事件、热点话题等,而且时间效率较为理想。 作为Web使用挖掘技术的一个重要应用之一用户行为分析旨在了解用户习惯、兴趣点等,分析评测用户的产品满意度,以便改善产品提升用户体验。 针对搜索引擎的用户满意度评测,本文阐述了一种基于用户使用行为的自动化解决方案。本方案首先介绍原始网络日志预先处理过程,即从日志数据中得到具体用户操作行为数据并进行特征抽取。然后,提出了一种基于CURE算法的推荐技术,人工对选取的样本进行标注。最后,利用动态建模技术完成对用户满意度的模型构建。实际搜索引擎数据表明,基于机器学习的自动化评测方案已经接近人工评测水平,达到了实际应用要求,并且动态模型通过多模型构建、自动更新、反馈纠正等机制可以有效延长生命周期,提高了学习的延续性。
[Abstract]:With the continuous development of information technology, computer and communication technology not only promote the development of information technology in modern society, but also affect and change people's modern life at the same time. However, information technology has brought the explosive growth of data at the same time, people urgently need a solution to effectively use and process the massive data. Under the background of big data, data mining technology emerges as the times require. As a branch of this field, Web mining technology is aimed at the effective combing and application of the massive data of the World wide Web. Because of the rapid development of Internet technology and the relatively late development of Web mining technology, this paper takes Web mining as the core of research, and deeply analyzes its application in the field of Internet. This paper first introduces the research background, current situation, technical difficulties and future development direction of Web technology, as well as the related concepts such as data mining, machine learning and so on. Then, we continue to pay attention to the implementation process and application scenarios of Web mining technology, and introduce the core implementation process of text preprocessing, topic detection and tracking, and user behavior analysis technology background. As one of the important applications of Web content mining technology, topic detection and dynamic tracking aims to detect unknown topics and track the future development of existing topics. To solve the problem of topic detection and dynamic tracking of news event-like text objects on network media, a hybrid clustering solution is implemented in this paper. Based on the "contribution degree", the topic model is adjusted hierarchically, which is more suitable for the construction of Internet news topics, and the efficiency performance has been greatly improved. The actual Internet news data show that compared with the K-Means algorithm, the accuracy and recall rate of this scheme are significantly improved, and the hierarchical effect of the topic tree model is obvious. Aiming at the topic detection and dynamic tracking of Chinese Weibo text objects, an incremental fuzzy clustering solution based on theme words is proposed in this paper. Firstly, according to Weibo's own text characteristics, a set of information anti-spam filtering scheme is put forward. Then, by using the two factors of timeliness and word frequency, the weight of the theme words is established to suit Weibo's characteristics. Finally, incremental fuzzy clustering method is used to complete the detection process of burst topic. The actual Weibo data show that this scheme can effectively detect unexpected events, hot topics and so on, and the time efficiency is ideal. As an important application of Web usage mining technology, user behavior analysis aims at understanding user habits, points of interest, and analyzing and evaluating users' product satisfaction, in order to improve the product and enhance the user experience. According to the evaluation of user satisfaction of search engine, this paper presents an automatic solution based on user's use behavior. This scheme first introduces the pre-processing process of the original network log, that is, the user's operation behavior data is obtained from the log data and the feature extraction is carried out. Then, a recommendation technique based on CURE algorithm is proposed to label the selected samples manually. Finally, the dynamic modeling technology is used to build the model of user satisfaction. The actual search engine data show that the automated evaluation scheme based on machine learning is close to the level of manual evaluation and meets the requirements of practical application, and the dynamic model is automatically updated through multi-model construction. Feedback correction and other mechanisms can effectively prolong the life cycle and improve the continuity of learning.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13;TP391.1
[Abstract]:With the continuous development of information technology, computer and communication technology not only promote the development of information technology in modern society, but also affect and change people's modern life at the same time. However, information technology has brought the explosive growth of data at the same time, people urgently need a solution to effectively use and process the massive data. Under the background of big data, data mining technology emerges as the times require. As a branch of this field, Web mining technology is aimed at the effective combing and application of the massive data of the World wide Web. Because of the rapid development of Internet technology and the relatively late development of Web mining technology, this paper takes Web mining as the core of research, and deeply analyzes its application in the field of Internet. This paper first introduces the research background, current situation, technical difficulties and future development direction of Web technology, as well as the related concepts such as data mining, machine learning and so on. Then, we continue to pay attention to the implementation process and application scenarios of Web mining technology, and introduce the core implementation process of text preprocessing, topic detection and tracking, and user behavior analysis technology background. As one of the important applications of Web content mining technology, topic detection and dynamic tracking aims to detect unknown topics and track the future development of existing topics. To solve the problem of topic detection and dynamic tracking of news event-like text objects on network media, a hybrid clustering solution is implemented in this paper. Based on the "contribution degree", the topic model is adjusted hierarchically, which is more suitable for the construction of Internet news topics, and the efficiency performance has been greatly improved. The actual Internet news data show that compared with the K-Means algorithm, the accuracy and recall rate of this scheme are significantly improved, and the hierarchical effect of the topic tree model is obvious. Aiming at the topic detection and dynamic tracking of Chinese Weibo text objects, an incremental fuzzy clustering solution based on theme words is proposed in this paper. Firstly, according to Weibo's own text characteristics, a set of information anti-spam filtering scheme is put forward. Then, by using the two factors of timeliness and word frequency, the weight of the theme words is established to suit Weibo's characteristics. Finally, incremental fuzzy clustering method is used to complete the detection process of burst topic. The actual Weibo data show that this scheme can effectively detect unexpected events, hot topics and so on, and the time efficiency is ideal. As an important application of Web usage mining technology, user behavior analysis aims at understanding user habits, points of interest, and analyzing and evaluating users' product satisfaction, in order to improve the product and enhance the user experience. According to the evaluation of user satisfaction of search engine, this paper presents an automatic solution based on user's use behavior. This scheme first introduces the pre-processing process of the original network log, that is, the user's operation behavior data is obtained from the log data and the feature extraction is carried out. Then, a recommendation technique based on CURE algorithm is proposed to label the selected samples manually. Finally, the dynamic modeling technology is used to build the model of user satisfaction. The actual search engine data show that the automated evaluation scheme based on machine learning is close to the level of manual evaluation and meets the requirements of practical application, and the dynamic model is automatically updated through multi-model construction. Feedback correction and other mechanisms can effectively prolong the life cycle and improve the continuity of learning.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13;TP391.1
【参考文献】
相关期刊论文 前10条
1 陈学昌;韩佳珍;魏桂英;;话题识别与跟踪技术发展研究[J];中国管理信息化;2011年09期
2 孙玲芳;夏聪;;Web使用挖掘在用户行为分析中的应用[J];江苏科技大学学报(自然科学版);2011年03期
3 王渊;;面向用户的搜索引擎检索结果评价[J];河南图书馆学刊;2007年04期
4 于满泉;骆卫华;许洪波;白硕;;话题识别与跟踪中的层次化话题识别技术研究[J];计算机研究与发展;2006年03期
5 张晨逸;孙建伶;丁轶群;;基于MB-LDA模型的微博主题挖掘[J];计算机研究与发展;2011年10期
6 程葳;龙志yN;;面向互联网新闻的在线话题检测算法[J];计算机工程;2009年18期
7 刘树超;李永臣;武洪萍;;Web数据挖掘研究与探讨[J];制造业自动化;2010年09期
8 张小丰;;面向Web的数据挖掘技术在网站优化中的个性化推荐方法的研究与应用[J];制造业自动化;2012年01期
9 江婕;李建民;曾R挽,
本文编号:2296793
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2296793.html