互联网业务重组与内容提取

发布时间：2019-01-15 07:25

【摘要】：互联网的迅猛发展带动了网络应用的快速增长,互联网为用户提供了种类繁多的网络业务,并不断满足网络用户的各种需求。每天都会产生海量的数据信息,过滤不良信息,筛选有用的信息,具有重要的研究价值与工程意义。本文致力于网络应用的业务重组与内容提取的研究与实现,主要工作内容包括三个部分,网络业务重组设计与实现、基于正则表达式的论坛社区应用的内容提取与安全审计、基于DOM树的网页内容提取与分析。本文首先介绍了HTML语言、DOM模型以及涉及到的报文采集技术,数据包重组技术等关键技术。其次,设计与实现了网络业务重组过程,其中介绍了数据包重组过程,并使用了libnids开源库实现了TCP会话重组,并对HTTP数据进行了压缩解码与块解码,得到了web页面。再次,采集几十种热门论坛通信数据,通过分析得到了几种常用的论坛通用系统,并提取了论坛识别特征,提出了论坛指纹概念,优化了传统的论坛审计方法。最后,结合网页特点与提取信息的特征,提出了基于DOM的网页提取方法：对网页进行预处理,选择标签作为网页提取特征,通过构建DOM树,实现了对网页内容的快速提取。通过这个方法完成了网络办公管理服务系统的软件版本跟踪模块,并分析了网页特征提取方法与网页特点。
[Abstract]:With the rapid development of the Internet, the rapid growth of network applications, the Internet provides users with a wide variety of network services, and constantly meet the needs of network users. It has important research value and engineering significance to produce massive data information, filter bad information and filter useful information every day. This paper is devoted to the research and implementation of business reorganization and content extraction of network application. The main work includes three parts: design and implementation of network business reorganization, content extraction and security audit of forum community application based on regular expression. Web content extraction and analysis based on DOM tree. This paper first introduces the HTML language, DOM model, packet collection technology, packet recombination technology and other key technologies. Secondly, this paper designs and implements the process of network business reorganization, which introduces the process of packet recombination, and uses libnids open source library to realize TCP session reconfiguration. The HTTP data is compressed and decoded, and the web page is obtained. Thirdly, through the analysis of dozens of popular forum communication data, several common forum systems are obtained, and the forum identification features are extracted, the concept of forum fingerprint is proposed, and the traditional forum auditing method is optimized. Finally, combining the characteristics of web pages and the features of extracting information, a method of web page extraction based on DOM is put forward: preprocessing the web pages, selecting tags as the feature of page extraction, and constructing the DOM tree to quickly extract the content of the web pages. Through this method, the software version tracking module of the network office management service system is completed, and the method of feature extraction and the feature of the web page are analyzed.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】