基于网络信息检索的网页文本抽取和处理的研究

发布时间：2018-04-28 06:22

本文选题：信息检索 + 主题网络爬虫　；参考：《南京邮电大学》2014年硕士论文

【摘要】：随着当今社会的飞速发展，，地理环境变化日新月异，传统地理信息测绘方法遇到诸多问题。互联网作为当今最重要的信息载体，拥有实时性强和信息获取价格低廉的优势，为地理信息测绘提供了一条新的渠道。结合网络信息检索技术和自然语言处理方法，能够从海量互联网信息中获取地理信息相关知识，完成地理信息变化更新的快速检索和实时检测，弥补了传统测绘方法的不足。本文对网络信息检索技术进行研究，从主题网络爬虫角度出发，针对现有主题爬虫算法通用性不强的问题，提出了基于链接回溯的主题爬虫算法。该算法针对当前新闻网站的链接结构特点，通过回溯的方法计算出最有可能包含主题相关内容的链接方向，从而大幅提高了主题相关网页的获取效率。同时结合网络文本挖掘和自然语言处理方法，设计了各项网页文本要素和地理信息要素的抽取方法，能够准确地从网页文本中抽取出相关信息。最终，本文实现了基于主题网络爬虫技术的地理信息变化检测原型系统。经过大量系统实验，证明该系统具有良好的可用性，查询结果有较高的查全率和查准率，同时验证了基于链接回溯的主题爬虫相比通用爬虫具有更好的爬取效率。
[Abstract]:With the rapid development of today's society, the geographical environment changes with each passing day, the traditional geographic information mapping method meets many problems. As the most important information carrier, Internet has the advantages of high real-time and low price, which provides a new channel for geographic information mapping. Combined with the technology of network information retrieval and natural language processing, it can obtain the knowledge of geographic information from mass Internet information, complete the quick retrieval and real-time detection of geographic information change update, and make up for the shortcomings of traditional surveying and mapping methods. In this paper, the network information retrieval technology is studied. From the point of view of topic crawler, a topic crawler algorithm based on link backtracking is proposed to solve the problem that the existing topic crawler algorithm is not universal enough. According to the characteristics of the link structure of the current news website, the method of backtracking is used to calculate the direction of the link which is most likely to contain theme-related content, thus greatly improving the efficiency of obtaining theme-related web pages. At the same time, combining the methods of Web text mining and natural language processing, this paper designs the extraction methods of web page text elements and geographical information elements, which can extract relevant information from web pages accurately. Finally, a prototype system of geographic information change detection based on topic crawler technology is implemented in this paper. Through a large number of system experiments, it is proved that the system has good usability, and the query results have high recall and precision. At the same time, it is verified that the topic crawler based on link backtracking has better crawling efficiency than that of common crawler.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【相似文献】