特定实体关系的识别和抽取及其系统的设计与实现
发布时间:2019-01-27 15:03
【摘要】:随着互联网技术的进步,互联网成为人们工作、生活上必不可缺的一部分。互联网最大的优势在于有海量信息供用户使用。然而,海量信息也带来了信息搜索的难题。搜索引擎的出现为用户提供了简单快捷的信息搜索途径。用户通过提交搜索关键词,就可以利用搜索引擎在海量信息中检索与关键字相关的内容,并得到内容页面的链接地址。但是,即使有搜索引擎的帮助,搜索结果的精确度依然很难让用户满意,尤其是当用户要搜索的是特定领域的特定信息以及它们之间的关系时,通常都需要在搜索引擎结果中去人工查找、分析。 本文基于对用户日常工作的调研,对用户感兴趣的特定实体抽取问题以及特定实体间关系抽取问题进行了研究,通过分析固定格式网页的信息分布特点,将网页源文件直接作为字符流来处理,利用正则表达式匹配技术对特定实体信息进行抽取,另外根据对用户需求的分析,,设计并实现了一个搜索关键词构造器,通过可配置的基础关键词和特殊关键词的组合,向搜索引擎提交不同的搜索请求,以获取更全面的非固定格式的网页搜索结果。在特定实体关系识别和抽取中,使用HTMLParser进行页面处理,提取通用搜索引擎返回的结果URL及URL指向页面的文本信息。使用中科院分词系统进行中文分词和词性标注处理,抽取出网页文本信息中的人名实体。使用正则表达式抽取文本中的电子邮件实体。最后根据中文姓名的拼音组合与邮箱前缀的关联特点,通过设定的抽取规则,抽取出特定实体间的关系。 本文还设计并实现了一个可用的B/S结构信息抽取系统,系统采用JAVA语言开发,包括三个主要模块:用户接口模块、特定实体抽取模块以及特定实体关系抽取模块,用户通过接口模块能够调用其他两个模块的功能,实现信息的自动抽取。 本文实现的信息抽取系统与用户传统的人工采集、分析工作相比,本系统可以大幅度降低用户的人工劳动,缩短信息的采集和分析时间,节约人力物力成本,提高工作效率,而且部署快速、维护简单,得到了用户的好评。
[Abstract]:With the progress of Internet technology, the Internet has become an indispensable part of people's work and life. The biggest advantage of the Internet is that there is a huge amount of information for users to use. However, mass information also brings the problem of information search. The appearance of search engine provides users with a simple and fast way to search for information. By submitting search keywords, users can use search engines to retrieve the content related to keywords in a large amount of information, and get the link address of the content page. However, even with the help of search engines, the accuracy of search results can be difficult to satisfy users, especially if they are searching for specific information in a particular domain and their relationships. Search engine results usually need to be manually searched and analyzed. Based on the investigation of users' daily work, this paper studies the extraction of specific entities of interest to users and the extraction of relations between specific entities, and analyzes the information distribution characteristics of fixed format web pages. A search keyword constructor is designed and implemented according to the analysis of the user's requirements, using the regular expression matching technology to extract the specific entity information, and the web page source file is directly treated as a character stream, and a search keyword constructor is designed and implemented based on the analysis of the user's requirements. Through configurable combination of basic keywords and special keywords, different search requests are submitted to search engines to obtain more comprehensive and non-fixed web search results. In the identification and extraction of specific entity relationships, HTMLParser is used for page processing to extract the text information returned by the general search engine URL and URL pointing to the page. The segmentation system of Chinese Academy of Sciences is used to deal with Chinese word segmentation and part of speech tagging. Use regular expressions to extract e-mail entities from text. Finally, according to the characteristics of the combination of pinyin of Chinese names and the prefixes of mailbox, the relationship between specific entities is extracted by the set extraction rules. This paper also designs and implements a usable information extraction system of B / S structure. The system is developed with JAVA language, including three main modules: user interface module, specific entity extraction module and specific entity relation extraction module. The user can call the function of the other two modules through the interface module to realize the automatic extraction of information. The information extraction system realized in this paper can greatly reduce the manual labor of users, shorten the time of information collection and analysis, save the cost of manpower and material resources, and improve the working efficiency, compared with the traditional manual collection and analysis of users. And the deployment is fast, the maintenance is simple, obtained the user's praise.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
本文编号:2416378
[Abstract]:With the progress of Internet technology, the Internet has become an indispensable part of people's work and life. The biggest advantage of the Internet is that there is a huge amount of information for users to use. However, mass information also brings the problem of information search. The appearance of search engine provides users with a simple and fast way to search for information. By submitting search keywords, users can use search engines to retrieve the content related to keywords in a large amount of information, and get the link address of the content page. However, even with the help of search engines, the accuracy of search results can be difficult to satisfy users, especially if they are searching for specific information in a particular domain and their relationships. Search engine results usually need to be manually searched and analyzed. Based on the investigation of users' daily work, this paper studies the extraction of specific entities of interest to users and the extraction of relations between specific entities, and analyzes the information distribution characteristics of fixed format web pages. A search keyword constructor is designed and implemented according to the analysis of the user's requirements, using the regular expression matching technology to extract the specific entity information, and the web page source file is directly treated as a character stream, and a search keyword constructor is designed and implemented based on the analysis of the user's requirements. Through configurable combination of basic keywords and special keywords, different search requests are submitted to search engines to obtain more comprehensive and non-fixed web search results. In the identification and extraction of specific entity relationships, HTMLParser is used for page processing to extract the text information returned by the general search engine URL and URL pointing to the page. The segmentation system of Chinese Academy of Sciences is used to deal with Chinese word segmentation and part of speech tagging. Use regular expressions to extract e-mail entities from text. Finally, according to the characteristics of the combination of pinyin of Chinese names and the prefixes of mailbox, the relationship between specific entities is extracted by the set extraction rules. This paper also designs and implements a usable information extraction system of B / S structure. The system is developed with JAVA language, including three main modules: user interface module, specific entity extraction module and specific entity relation extraction module. The user can call the function of the other two modules through the interface module to realize the automatic extraction of information. The information extraction system realized in this paper can greatly reduce the manual labor of users, shorten the time of information collection and analysis, save the cost of manpower and material resources, and improve the working efficiency, compared with the traditional manual collection and analysis of users. And the deployment is fast, the maintenance is simple, obtained the user's praise.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前8条
1 杨树林;;正则表达式在网络教学系统中的应用[J];北京印刷学院学报;2005年04期
2 贺令亚;柳佳刚;;基于Web的包装器技术的现状与发展[J];电脑开发与应用;2007年06期
3 白红哲,马立勇;基于正则表达式的话务报告处理软件的实现[J];通信管理与技术;2005年02期
4 周源远,王继成,郑刚,张福炎;Web页面清洗技术的研究与实现[J];计算机工程;2002年09期
5 程冲,黄水清;利用正则表达式解析新闻网页的算法研究[J];农业图书情报学刊;2005年04期
6 车万翔,刘挺,李生;实体关系自动抽取[J];中文信息学报;2005年02期
7 俞鸿魁;张华平;刘群;吕学强;施水才;;基于层叠隐马尔可夫模型的中文命名实体识别[J];通信学报;2006年02期
8 徐健;张智雄;吴振新;;实体关系抽取的技术方法综述[J];现代图书情报技术;2008年08期
相关硕士学位论文 前4条
1 邹永强;新闻网页中人物实体关系提取技术研究[D];国防科学技术大学;2011年
2 徐芬;基于SVM和TSVM的中文实体关系抽取[D];国防科学技术大学;2007年
3 雷佩莹;基于Web的新闻信息抽取系统设计与实现[D];西北大学;2008年
4 黄鑫;基于特征向量的中文实体间语义关系抽取研究[D];苏州大学;2009年
本文编号:2416378
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2416378.html