特定实体关系的识别和抽取及其系统的设计与实现

发布时间：2019-01-27 15:03

【摘要】：随着互联网技术的进步，互联网成为人们工作、生活上必不可缺的一部分。互联网最大的优势在于有海量信息供用户使用。然而，海量信息也带来了信息搜索的难题。搜索引擎的出现为用户提供了简单快捷的信息搜索途径。用户通过提交搜索关键词，就可以利用搜索引擎在海量信息中检索与关键字相关的内容，并得到内容页面的链接地址。但是，即使有搜索引擎的帮助，搜索结果的精确度依然很难让用户满意，尤其是当用户要搜索的是特定领域的特定信息以及它们之间的关系时，通常都需要在搜索引擎结果中去人工查找、分析。本文基于对用户日常工作的调研，对用户感兴趣的特定实体抽取问题以及特定实体间关系抽取问题进行了研究，通过分析固定格式网页的信息分布特点，将网页源文件直接作为字符流来处理，利用正则表达式匹配技术对特定实体信息进行抽取，另外根据对用户需求的分析，，设计并实现了一个搜索关键词构造器，通过可配置的基础关键词和特殊关键词的组合，向搜索引擎提交不同的搜索请求，以获取更全面的非固定格式的网页搜索结果。在特定实体关系识别和抽取中，使用HTMLParser进行页面处理，提取通用搜索引擎返回的结果URL及URL指向页面的文本信息。使用中科院分词系统进行中文分词和词性标注处理，抽取出网页文本信息中的人名实体。使用正则表达式抽取文本中的电子邮件实体。最后根据中文姓名的拼音组合与邮箱前缀的关联特点，通过设定的抽取规则，抽取出特定实体间的关系。本文还设计并实现了一个可用的B/S结构信息抽取系统，系统采用JAVA语言开发，包括三个主要模块：用户接口模块、特定实体抽取模块以及特定实体关系抽取模块，用户通过接口模块能够调用其他两个模块的功能，实现信息的自动抽取。本文实现的信息抽取系统与用户传统的人工采集、分析工作相比，本系统可以大幅度降低用户的人工劳动，缩短信息的采集和分析时间，节约人力物力成本，提高工作效率，而且部署快速、维护简单，得到了用户的好评。
[Abstract]:With the progress of Internet technology, the Internet has become an indispensable part of people's work and life. The biggest advantage of the Internet is that there is a huge amount of information for users to use. However, mass information also brings the problem of information search. The appearance of search engine provides users with a simple and fast way to search for information. By submitting search keywords, users can use search engines to retrieve the content related to keywords in a large amount of information, and get the link address of the content page. However, even with the help of search engines, the accuracy of search results can be difficult to satisfy users, especially if they are searching for specific information in a particular domain and their relationships. Search engine results usually need to be manually searched and analyzed. Based on the investigation of users' daily work, this paper studies the extraction of specific entities of interest to users and the extraction of relations between specific entities, and analyzes the information distribution characteristics of fixed format web pages. A search keyword constructor is designed and implemented according to the analysis of the user's requirements, using the regular expression matching technology to extract the specific entity information, and the web page source file is directly treated as a character stream, and a search keyword constructor is designed and implemented based on the analysis of the user's requirements. Through configurable combination of basic keywords and special keywords, different search requests are submitted to search engines to obtain more comprehensive and non-fixed web search results. In the identification and extraction of specific entity relationships, HTMLParser is used for page processing to extract the text information returned by the general search engine URL and URL pointing to the page. The segmentation system of Chinese Academy of Sciences is used to deal with Chinese word segmentation and part of speech tagging. Use regular expressions to extract e-mail entities from text. Finally, according to the characteristics of the combination of pinyin of Chinese names and the prefixes of mailbox, the relationship between specific entities is extracted by the set extraction rules. This paper also designs and implements a usable information extraction system of B / S structure. The system is developed with JAVA language, including three main modules: user interface module, specific entity extraction module and specific entity relation extraction module. The user can call the function of the other two modules through the interface module to realize the automatic extraction of information. The information extraction system realized in this paper can greatly reduce the manual labor of users, shorten the time of information collection and analysis, save the cost of manpower and material resources, and improve the working efficiency, compared with the traditional manual collection and analysis of users. And the deployment is fast, the maintenance is simple, obtained the user's praise.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】