当前位置:主页 > 文艺论文 > 广告艺术论文 >

基于用户行为分析的网页分类系统的研究与实现

发布时间:2018-01-15 14:30

  本文关键词:基于用户行为分析的网页分类系统的研究与实现 出处:《北京邮电大学》2011年硕士论文 论文类型:学位论文


  更多相关文章: 用户行为分析 网页自动分类 中文分词 CHI统计 SVM


【摘要】:近年来,随着因特网飞速发展,网络上以网页为载体的各种文本信息大量涌现,网上信息量呈爆炸性增长。人们要找到自己所需要的信息犹如大海捞针,被动模式的搜索引擎已经不能满足用户的需求。如何以主动模式满足用户个性化的服务要求,成为新的网络服务系统面临的挑战性课题之一。本文基于用户行为分析和个性化服务的前提,针对网页分类技术中的关键技术进行研究和改进,最终实现了一个适应于网页分类的文本分类系统。本文主要研究的关键技术包括: 第一,中文分词技术。本文对原有的分词方法进行研究,并提出了一种适合于网页文本特点的基于统计与最大匹配结合的分词算法,该方法能识别出网页中的新生词汇,且合并频繁出现的单字组合。改进的方法既避免了漏掉对分类有很大贡献的新生词汇,也通过合并单字减小了特征空间维数,降低了计算复杂度。 第二,特征抽取和赋权技术。本文通过研究和考察特征选择算法和赋权算法,对普遍认为效果较好的CHI统计方法进行了适合于网页分类的改进,提出了基于网页结构的CHI统计特征选择算法和TD-IDF-CHI赋权算法。实验结果表明,这两种预处理算法在一定程度上提高了分类精度。 本文基于以上改进的算法实现了一个网页分类模块,同时也设计并实现了一个完整的用户行为分析系统,该系统主要包括三大模块:数据采集过滤模块、网页分类模块和结果统计模块。三大模块所完成的功能如下: 第一,数据采集过滤模块。Web行为的用户属性信息存在于HTTP包的头部,要获得用户的信息就需要对HTTP包进行解析和信息提取。数据采集过滤模块中介绍了本文所设计实现的HTTP包解析的流程。 第二,网页分类模块是本文主要的研究对象。该模块基于改进的分词算法、预处理算法和分类效果较好的KNN和SVM分类算法,实现了将网页映射到特定类别的过程。 第三,结果统计模块。该模块总结并更新用户访问的网页的分类结果,并与个性化服务系统直接相连,将用户行为分析的结果直接应用于个性化广告反馈等服务中去。 本文所研究并实现的基于用户行为分析的网页分类系统适用于网页在线分类和离线分类两种模式,实验结果表明,改进的预处理算法对分类准确度有很好的矫正,结果统计模块的设计也获得了较好的结果,充分反映了用户当前的兴趣,为个性化服务系统的研究提供了参考模型。
[Abstract]:In recent years, with the rapid development of the Internet, a large number of text information based on web pages has emerged, and the amount of information on the Internet has increased explosively. People want to find the information they need is like looking for a needle in a haystack. Passive search engine can not meet the needs of users. How to use active mode to meet the user's personalized service requirements. Based on the premise of user behavior analysis and personalized service, this paper studies and improves the key technologies of web page classification technology. Finally, a text classification system suitable for web page classification is implemented. The key technologies of this paper include: First, the Chinese word segmentation technology. This paper studies the original word segmentation methods, and proposes a word segmentation algorithm based on the combination of statistics and maximum matching. This method can recognize the new words in the web pages and combine the frequent word combinations. The improved method not only avoids the omission of the new vocabulary which has a great contribution to the classification. The dimension of feature space is reduced by combining words, and the computational complexity is reduced. Secondly, feature extraction and weighting techniques. Through the research and investigation of feature selection algorithm and weighting algorithm, the CHI statistical method, which is generally considered to be effective, is improved for web page classification. CHI statistical feature selection algorithm and TD-IDF-CHI weighting algorithm based on web structure are proposed. The experimental results show that the two preprocessing algorithms improve the classification accuracy to some extent. This paper implements a web page classification module based on the above improved algorithm, and also designs and implements a complete user behavior analysis system. The system mainly includes three modules: data acquisition and filtering module. The web classification module and the results statistics module. The functions of the three modules are as follows: First, the user attribute information of the data acquisition and filtering module. The web behavior exists in the header of the HTTP package. In order to get the user's information, we need to parse and extract the HTTP packet. The flow of HTTP packet parsing designed and implemented in this paper is introduced in the data acquisition and filtering module. Second, the web page classification module is the main research object of this paper. This module is based on the improved word segmentation algorithm, preprocessing algorithm and the better classification effect of KNN and SVM classification algorithm. The process of mapping web pages to specific categories is implemented. Third, the result statistics module. This module summarizes and updates the classification results of the web pages visited by the user, and is directly connected with the personalized service system. The results of user behavior analysis are directly applied to personalized advertising feedback and other services. The web page classification system based on user behavior analysis in this paper is suitable for both online and offline web page classification. The experimental results show that this system can be used to classify web pages on line and offline. The improved preprocessing algorithm has a good correction to the classification accuracy, and the design of the result statistics module has obtained good results, which fully reflects the current interest of users. It provides a reference model for the research of personalized service system.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:TP393.092

【参考文献】

相关期刊论文 前10条

1 孙健,王伟,钟义信;基于K-最近距离的自动文本分类的研究[J];北京邮电大学学报;2001年01期

2 尹中航,王永成,蔡巍;应用支持向量机进行网上信息自动分类[J];高技术通讯;2001年11期

3 李静梅,孙丽华,张巧荣,张春生;一种文本处理中的朴素贝叶斯分类器[J];哈尔滨工程大学学报;2003年01期

4 田盛丰,黄厚宽;基于支持向量机的数据库学习算法[J];计算机研究与发展;2000年01期

5 王继成,潘金贵,张福炎;Web文本挖掘技术研究[J];计算机研究与发展;2000年05期

6 陆玉昌,鲁明羽,李凡,周立柱;向量空间法中单词权重函数的分析和构造[J];计算机研究与发展;2002年10期

7 徐凤亚,罗振声;文本自动分类中特征权重算法的改进研究[J];计算机工程与应用;2005年01期

8 路斌,杨建武,陈晓鸥;一种基于SVM的多层分类策略[J];计算机工程;2005年01期

9 梁南元;书面汉语自动分词系统—CDWS[J];中文信息学报;1987年02期

10 周运清,苏娜;网络行为与社会控制[J];情报杂志;1999年03期



本文编号:1428770

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1428770.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户b7ad9***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com