Web信息抽取在书签系统中的应用研究与实现

发布时间：2018-12-15 19:39

【摘要】：社会化书签系统是Web信息资源收集、管理、分享的有效工具,但是它的社会化功能取决于用户量与资源量。本文主要的研究内容是如何将Web信息抽取等自然语言相关研究应用于书签系统中,解决书签系统的冷启动问题,提高用户体验。本文首先研究并实现了Web信息抽取算法。本文的Web信息抽取算法以Goose项目为基础,改进了Web网页数据抓取,添加了对网页编码的自动识别,通过观察与总结大量网站的HTML结构特征,优化了对网页的预处理,并添加了对中文网页信息抽取的支持,最后对正文进行格式化处理,以优化阅读体验。最终实现了基于ElementTree的Web信息抽取模块。该模块能够用于生产系统中,具有较强的实用性。同时本文基于Web信息抽取的结果与Web网页的元数据,实现了基于资源的标签推荐算法,并简单实现了网页摘要功能。本文设计并实现了书签系统,基础架构采用Tornado作为Web服务器兼Web开发框架,MongoDB作为数据库服务器,客户端使用AngularJS框架、j Query框架,同时使用BootStrap3样式风格,实现了响应式布局与扁平化网格的客户端应用,并实现了Chrome浏览器插件。系统实现中整合了Web信息抽取模块,为用户提供书签内容阅读编辑等功能,有效的提高了用户体验。基于信息抽取的结果,本文书签系统的搜索功能能够采用了全文搜索实现,避免了传统书签系统中通常只针对标签或标题进行搜索的局限性,也避免了对整个Web页面进行全文搜索存在的噪音信息。本文实现的系统不同于当前热门的推荐阅读系统,更注重书签管理而非阅读,如果能将书签系统与笔记系统结合使用,可以有效实现信息的二次过滤。
[Abstract]:Social bookmarking system is an effective tool for Web information resource collection, management and sharing, but its social function depends on the number of users and resources. The main research content of this paper is how to apply the natural language related research such as Web information extraction to the bookmark system to solve the cold start problem of the bookmark system and improve the user experience. In this paper, we first study and implement the Web information extraction algorithm. Based on the Goose project, the Web information extraction algorithm in this paper improves the Web web page data capture, adds the automatic recognition to the web page coding, and optimizes the preprocessing of the web pages by observing and summarizing the HTML structure features of a large number of websites. Finally, the text is formatted to optimize the reading experience. Finally, the Web information extraction module based on ElementTree is implemented. This module can be used in production system and has strong practicability. At the same time, based on the results of Web information extraction and the metadata of Web pages, a resource-based label recommendation algorithm is implemented, and a simple function of web page summary is realized. In this paper, a bookmark system is designed and implemented. The infrastructure uses Tornado as Web server and Web development framework, MongoDB as database server, AngularJS, j Query as client, and BootStrap3 style. The client application of response layout and flat grid is realized, and the Chrome browser plug-in is implemented. The system integrates Web information extraction module, provides users with bookmark content reading and editing functions, effectively improve the user experience. Based on the result of information extraction, the search function of the bookmark system in this paper can be realized by full-text search, which avoids the limitation of traditional bookmark system which only searches for tags or titles. Also avoid the entire Web page full-text search for the existence of noise information. The system realized in this paper is different from the popular recommendation reading system. It pays more attention to bookmark management than reading. If we can combine bookmark system with note-taking system, we can effectively realize the secondary filtering of information.
【学位授予单位】：南京理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.3

【参考文献】