基于Web内容的业务洞察系统的设计与实现

发布时间：2018-04-29 17:23

本文选题：URL分析 + 网页分类　；参考：《北京邮电大学》2017年硕士论文

【摘要】：互联网时代是信息爆发的时代,人们可以浏览多种多样的网络资源,塑造自己独特的浏览习惯。对于单个用户而言,其访问的网络资源信息的集合在一定程度上代表了其浏览习惯以及兴趣爱好。目前针对这些日志的普遍处理方法是采用DPI技术进行常规的字段统计,不涉及到对报文内的具体内容的分析,或者针对内容的分析只局限于URL指向的页面内容的目标文本,忽视了 URL资源的结构特点等诸多因素,最终降低了内容分析的精度。将URL资源的背景知识等信息也作为分析的原材料,结合URL的多级结构特点和网页类型特点实现对Web内容(Web页面和URL)的信息提取与分析的方法成为了研究重点。本文围绕网络运营商如何针对用户进行业务洞察的背景和需求,对基于Web内容的业务洞察实现时所需要的相关技术方案进行研究,最终设计并开发完成基于Web内容的业务洞察系统的搭建。主要研究内容有:1.研究新闻类、视频类、电子商务类的不同类型网页内容提取。本文分析了不同类型的网页的结构特点并设计和实现了不同类型的网页内容的提取方法,最终运用在URL分析和Web内容分析等功能模块中;2.研究URL标签信息获取。本文对URL的结构特点和背景知识进行分析,并归纳总结出一种可以识别URL信息并对信息进行统一化自动管理的方法;3.研究系统的平台架构搭建方案。本文从需求出发,将零散的技术以功能模块的形式进行整合,最终转化为完整的系统。根据对相关技术研究和调研所得到的解决方案,本文实现了网页信息多级标签获取方法,将URL拆分成多个字段并对每个字段的内容进行归类和解析的方法以及通过网络资源搜索匹配及识别信息的处理方法,并通过测试验证了这些方法的有效性。基于以上关键技术方案的实现,本文完成了基于Web内容的业务洞察系统的开发,该系统根据用户网络访问日志中的请求URL字段集合,实现了 URL分析,网页分类,Web内容分析,规则管理等功能,将URL字段集合转化为用户的行为特征信息,为用户特征提取提供基础,同时为网络运营商等服务提供商针对用户进行业务洞察提供了先决条件。
[Abstract]:Internet era is the era of information explosion, people can browse a variety of network resources, shape their own unique browsing habits. To a certain extent, the collection of network resources information accessed by a single user represents their browsing habits and interests. At present, the general method of dealing with these logs is to use the DPI technology to carry on the conventional field statistics, which does not involve the analysis of the specific content in the message, or the analysis of the content is limited to the target text of the page content pointed to by the URL. Many factors, such as the structural characteristics of URL resources, are ignored, and the accuracy of content analysis is reduced. The information such as background knowledge of URL resources is also used as the raw material of analysis, and the method of extracting and analyzing the information of URL content web pages and URLs based on the characteristics of multilevel structure and web page type of URL has become the focus of research. This paper focuses on the background and requirements of network operators how to carry out business insight for users, and studies the relevant technical solutions needed for the realization of business insight based on Web content. Finally, we design and develop the business insight system based on Web content. The main research contents are: 1. Research on different types of web content extraction of news, video and e-commerce. This paper analyzes the structural characteristics of different types of web pages and designs and implements the extraction methods of different types of web pages. Finally, it is used in the functional modules of URL analysis and Web content analysis. URL tag information acquisition is studied. In this paper, the structural characteristics and background knowledge of URL are analyzed, and a method of recognizing URL information and managing it automatically is summarized. Research the platform architecture of the system. In this paper, the scattered technology is integrated in the form of functional modules, and finally transformed into a complete system. According to the solution of research and research on related technology, this paper realizes the method of obtaining multilevel tags of web information. The URL is divided into several fields and the contents of each field are classified and parsed, and the methods of searching, matching and identifying information through network resources are presented, and the validity of these methods is verified by testing. Based on the implementation of the above key technology, this paper completes the development of a business insight system based on Web content. According to the set of requested URL fields in user network access log, the system realizes URL analysis and web page classification. The function of rule management transforms the URL field set into the behavior characteristic information of the user, which provides the basis for the feature extraction of the user, and also provides the precondition for the service provider such as the network operator to carry on the service insight to the user.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.09

【参考文献】