中立RDF知识库构建问题研究与应用

发布时间：2018-03-08 11:29

本文选题：知识库　切入点：资源描述框架　出处：《西南交通大学》2016年硕士论文　论文类型：学位论文

【摘要】：互联网上的大数据给人类生活带来了丰富的信息,人们只需要通过关键字进行搜索,就能获取到相关新闻、资料链接。然而,这种通过点击链接的方式使得人类在面对持续增加的海量数据获取知识与信息时变得十分低效。目前互联网上的信息大多以网页的形式进行存储与发布,通过超链接的形式将文档关联起来,这种方式使得人类可以理解文档中的信息,而计算机却难以对文档中的信息进行理解。为了更好地利用互联网产生的大数据资源,国外已有研究机构从英文维基百科中构建了知识库,如FreeBase, DBPedia等。国内的知识库有百度知心、搜狗知立方及清华XLore等。知识库在知识图谱、信息融合及人工智能问答等研究领域具有重要的应用价值。国外的知识库如FreeBase等提供了公开的资源描述框架数据源,但包含的中文实体数据量较少,如何构建高质量的中文RDF知识库成为目前的研究热点。基于上述背景,本文对基于网络百科构建中文RDF知识库的方法进行了研究,并在以下几个方面开展了工作：1.深入研究了大规模网络百科数据采集技术,分析了数据采集中遇到的具体问题与挑战,结合Spring MVC框架与Scrapy框架构建了一个网络百科数据采集系统,爬取性能稳定且具有良好的人机交互界面。提出了一种代理IP信息自动抽取算法,该方法能够有效抽取代理IP信息,并解决网站的反爬取问题。2.研究了针对网络百科数据实体信息抽取技术,提出了利用RDFS语义信息对抽取数据进行语义标注及RDF数据规范化的方法。研究了RDF数据的图数据库存储方法,开发了基于NEO4J的RDF数据图存储系统,与传统的关系型数据库存储方式进行了比较,结果表明本文实现的存储系统能够满足大规模RDF数据的存储与查询需求。3.深入研究了基于百度百科与互动百科异构数据源构建知识库过程中遇到的实体对齐问题,提出了一种基于实体属性信息及上下文主题特征相结合进行实体对齐的方法,与传统的实体对齐方法进行了比较,结果表明本论文提出的方法优于现有实体对齐方法。4.将大规模网络百科数据采集技术、实体信息RDF转化、存储与SPARQL查询技术以及异构数据源实体对齐方法相结合,设计并实现了一个中文网络百科RDF知识库自动构建系统,该系统能够通过配置采集任务,下载网络百科数据,进行实体数据抽取与RDF转化与存储,从而为外部应用提供实体查询与SPARQL查询的功能。
[Abstract]:Big data on the Internet has brought a wealth of information to human life, people only need to search through the keyword to obtain relevant news, information links. However, This way of clicking on links makes it very inefficient for people to acquire knowledge and information in the face of the ever-increasing mass of data. At present, most of the information on the Internet is stored and published in the form of web pages. Linking documents in the form of hyperlinks makes it possible for humans to understand the information in documents, while computers find it difficult to understand them. In order to make better use of big data's resources generated by the Internet, Foreign research institutions have constructed knowledge bases from Wikipedia in English, such as FreeBase, DBPedia, etc. The knowledge bases in China are known by Baidu, Sogou, Tsinghua XLore, etc. The knowledge bases are in the knowledge atlas. The research field of information fusion and artificial intelligence question and answer has important application value. The knowledge base of foreign countries, such as FreeBase and so on, provides the open data source of resource description framework, but it contains less Chinese entity data. How to build a high-quality Chinese RDF knowledge base has become a hot research topic. Based on the above background, this paper studies the method of constructing Chinese RDF knowledge base based on network encyclopedia. And has carried out the work in the following several aspects: 1.deeply studied the large-scale network encyclopedia data collection technology, analyzed in the data collection concrete question and the challenge, A network encyclopedia data acquisition system based on Spring MVC framework and Scrapy framework is constructed. The crawling performance is stable and has good human-computer interface. A proxy IP information extraction algorithm is proposed, which can extract proxy IP information effectively. And solve the backcrawling problem of the website. 2. The technology of entity information extraction for the network encyclopedia data is studied. This paper puts forward a method of semantic annotation of extracted data and standardization of RDF data by using RDFS semantic information, studies the storage method of RDF data graph database, and develops a RDF data graph storage system based on NEO4J. Compared with the traditional relational database storage, The results show that the storage system realized in this paper can meet the storage and query requirements of large-scale RDF data. 3. The problem of entity alignment in the process of building a knowledge base based on Baidu encyclopedia and interactive encyclopedia heterogeneous data sources is studied in depth. A method of entity alignment based on entity attribute information and context subject feature is proposed, which is compared with traditional entity alignment method. The results show that the method proposed in this paper is superior to the existing entity alignment method. 4. Combining the large-scale network encyclopedia data acquisition technology, entity information RDF transformation, storage with SPARQL query technology and heterogeneous data source entity alignment method. This paper designs and implements an automatic construction system of RDF knowledge base of Chinese network encyclopedia. The system can extract entity data and transform and store RDF data by configuring collecting task, downloading network encyclopedia data. Thus provides the entity query and the SPARQL query function for the external application.
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】