基于智能网关的用户Web信息采集与分析系统

发布时间：2018-05-08 12:50

本文选题：Web信息采集 + 关键词提取　；参考：《山东大学》2016年硕士论文

【摘要】：信息时代的到来使互联网成为个人及家庭最重要的信息来源,越来越多的用户通过各种智能终端设备接入互联网,这种信息获取和交流的方式已逐渐成为当今时代的主流。紧随而来的各种快捷便利的服务软件使各大互联网公司逐渐意识到用户信息作为一种战略资产具有极高的经济价值。因此,把握海量数据背景下的用户Web信息,分析用户行为习惯无论是对学术研究的推动还是对企业客户资源的维系和发展都是具有着重要的意义。目前,分析用户行为的数据主要来源是服务器用户日志和浏览器cookie。前者是用户访问目标网站时,网站记录用户相关行为,按特定格式生成服务器日志；后者则通过网站上加挂的脚本将用户信息发送给后台服务器端。这两种方法都依赖特定的网站,比较理想的情况是用户访问不同网站时都能拿到用户的访问数据,而路由器作为家庭网络链接和数据分发的中心,在家庭组网中占据着至关重要的位置。针对路由器的这种优势,本论文设计并实现了一种基于智能路由器的用户Web信息采集和分析系统,重点解决了用户信息采集方式的局限性和采集信息的片面性问题。该系统分为网关和后台两部分,网关侧完成用户ID和浏览网址的提取与传输,后台服务器接收网关侧采集的数据后,主要完成相应Web界面的正文和关键词的提取、页面浏览时间统计、子链接爬取与相关度计算以及文本主题分类等信息的采集与分析。本论文创新点主要包括以下五个方面：(1)分析了系统应用的特有环境要求和应用场景,结合新闻主题类和商品购物类网站的网页结构特点,提出了文本密度与多特征值相结合的Web正文抽取算法,既提高了网页正文的抽取速度又保证了抽取的准确率。(2)提出一种基于统计、结构、语言分析相结合的TF-IDF文本关键词提取算法,该算法考虑了词长、词跨度等特征对关键词提取的影响,克服了传统TF-IDF提取算法完全基于词频统计的缺陷。(3)设计了一种网络爬虫的主题爬取策略,基于提出的文本关键词提取算法和VSM文本相似度计量原理,实现了两层网页的子链接爬取与相关度计算。(4)提出一种卡方值加权的贝叶斯分类算法,该算法更加强调在文本分类过程中类别与特征之间的相关性关系,提高了文本分类的准确率。(5)提出一套用户Web信息采集与分析系统的整体设计方案,并通过编写程序完成整个系统实现,最后在基于OpenWrt智能路由的家庭局域网内测试了该方案的可行性。
[Abstract]:With the advent of the information age, the Internet has become the most important source of information for individuals and families. More and more users connect to the Internet through various intelligent terminal devices. This way of information acquisition and communication has gradually become the mainstream of the times. All kinds of fast and convenient service software make the major Internet companies realize that user information has high economic value as a strategic asset. Therefore, it is of great significance to grasp the user Web information under the background of massive data and analyze the behavior habits of users, whether it is the promotion of academic research or the maintenance and development of enterprise customer resources. At present, the main sources of data for analyzing user behavior are server user log and browser cookie. The former is when the user visits the target website, the website records the user's related behavior and generates the server log according to the specific format; the latter sends the user information to the background server through the script added on the website. Both approaches rely on specific sites, ideally where users can access data when they visit different sites, while routers act as a hub for home network links and data distribution. In the home network occupies the vital position. Aiming at the advantages of routers, this paper designs and implements a user Web information acquisition and analysis system based on intelligent router, which focuses on solving the limitation of user information collection and the one-sidedness of collecting information. The system is divided into two parts: gateway and background. The gateway side completes the extraction and transmission of user ID and browsing web site. After receiving the data collected from the gateway side, the background server mainly completes the extraction of the text and key words of the corresponding Web interface. Page browsing time statistics, sub-link crawling and correlation calculation, text topic classification and other information collection and analysis. The innovation of this paper mainly includes the following five aspects: 1) analyzing the special environmental requirements and application scenarios of the system application, combining the web structure characteristics of the news subject category and the commodity shopping website. In this paper, a Web text extraction algorithm combining text density with multiple eigenvalues is proposed, which not only improves the extraction speed of web pages, but also ensures the accuracy of extraction. This algorithm combines language analysis with TF-IDF text keyword extraction algorithm, which takes into account the influence of word length, word span and other features on keyword extraction. This paper overcomes the shortcoming of traditional TF-IDF extraction algorithm based entirely on word frequency statistics. It designs a topic crawling strategy for web crawlers, based on the proposed text keyword extraction algorithm and the principle of VSM text similarity measurement. In this paper, we implement sub-link crawling and correlation calculation of two-layer web pages. We propose a chi-square weighted Bayesian classification algorithm, which emphasizes the correlation between category and feature in the process of text classification. Improve the accuracy of text classification. (5) put forward a set of user Web information collection and analysis system overall design scheme, and complete the whole system by writing a program. Finally, the feasibility of the scheme is tested in the home LAN based on OpenWrt intelligent routing.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP274

【相似文献】