面向web的文本地理信息挖掘技术研究

发布时间：2018-08-16 18:28

【摘要】：地理信息在民用、商用、国防等方面都有着重要的应用,而地理信息的获取却受到多方面限制。目前,互联网中存在着大量的地理信息,通过网络获取地理信息,突破传统地理信息获取手段的限制,已经成为地理信息获取的一种重要手段。但网络数据海量、数据类型繁杂,导致从网络获取地理信息十分困难。为解决这一问题,本文对地理信息的获取及地理信息的分类展开了研究。本文提出一种结合地理信息本体库的主题网络爬虫算法,通过构建地理信息本体库,对网页内容相关度进行评估;同时结合网页链接过滤、网页链接权威度评估,对网页进行网络地理信息的筛选。实验结果表明,本文提出的算法能够有效地过滤与地理信息不相关网页,并提高了地理信息网页获取的准确度。本文针对地理信息分类提出了一种融合距离阈值的最近邻分类算法,该算法依据类别的重心与待分类样本的空间距离,通过对比设定的距离阈值对分类样本进行类别划分。实验结果表明,本文提出的算法能够有效地对地理信息进行分类,分类准确度较高。同时利用Apriori算法实现了对地理信息关联规则的挖掘。最后,利用提出的主题网络爬虫算法、最近邻分类算法,实现了面向web的文本地理信息挖掘系统。该系统将网页文本与地理信息本体库中的本体进行对比,评估网页相关度。筛选并获取地理信息相关度高的网页文本,进行预处理并提取网页文本特征,利用网页文本特征集将网页文本转换为空间向量并进行分类处理。通过对比基础地理信息关键词、提取文本摘要对所需地名地点进行信息抽取。利用Apriori算法实现对地理信息的关联规则提取。系统测试结果表明,本文设计的Web地理信息挖掘系统,实现了 web文本获取、web文本分类、文本信息抽取及地理信息关联规则挖掘的功能。
[Abstract]:Geographic information has important applications in civil, commercial, national defense and so on. However, the acquisition of geographic information is restricted by many aspects. At present, there are a lot of geographic information in the Internet. Getting geographic information through the network, breaking through the limitations of traditional means of geographic information acquisition, has become an important means of geographic information acquisition. In order to solve this problem, this paper studies the acquisition of geographic information and the classification of geographic information. In this paper, a topic-based web crawler algorithm based on geographic information ontology database is proposed. By constructing geographic information ontology database, it is very difficult to obtain geographic information from the network. The experimental results show that the algorithm proposed in this paper can effectively filter web pages that are not related to geographical information and improve the accuracy of geographic information web pages. A nearest neighbor classification algorithm based on distance threshold is proposed, which classifies the classified samples according to the space distance between the center of gravity of the class and the sample to be classified. The experimental results show that the proposed algorithm can effectively classify the geographic information with high classification accuracy. Finally, a Web-oriented textual geographic information mining system is implemented by using the proposed topic web crawler algorithm and the nearest neighbor classification algorithm. The system compares the web text with the ontology in the geographic information ontology database, and evaluates the web page correlation. Web page text with high geographic information correlation is preprocessed and extracted. Web page text is transformed into space vector by Web page text feature set and classified. By comparing the basic geographic information keywords, text summary is extracted to extract the information of the place and place needed. Apriori algorithm is used to realize the location. The system test results show that the Web Geographic Information Mining System designed in this paper achieves the functions of Web text acquisition, Web text classification, text information extraction and geographic information association rules mining.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【参考文献】