基于Web的大规模中文人物信息提取研究
发布时间:2018-07-20 15:17
【摘要】:现代人越来越依赖于从互联网上检索信息,人物信息是人们关注检索的一个重要领域。本文致力于抽取尽可能多的重要人物信息,构建一个人物信息的知识库,既可以作为人物搜索引擎的知识库,也可以作为语义搜索引擎的知识库的人物相关部分。网络上有海量的人物信息,但是这些信息格式多样、内容纷乱,大量的垃圾信息又充斥其中,如何从互联网中自动高效地抽取准确的信息相对复杂,有很多问题需要解决。本文研究了一个从网页数据采集、网页正文抽取、中文分词处理到人物信息结构化的完整过程,每个部分都对应论文的一章。 首先是网页数据的采集。论文详述了人物信息网页来源的选取和网页的下载方法。网页下载越来越困难,网站对爬虫程序的限制越来越严,甚至采取了各种反爬虫措施,比如对同一IP访问频率的限制。作者自己编写程序下载网页数据,针对网站的不同情况采用了三种网页数据的下载方式:一般下载方式、代理下载方式和动态网页数据的下载方式。 然后是对网页正文进行抽取。论文综述了网页正文抽取的相关研究,采用了基于统计和DOM的方法进行正文抽取。方法采用的统计信息是正文字长、超链接数和结束标点符号数。对每个容器标签,统计三个信息值后,利用它们的数量比值判断标签是否正文标签,进而抽取正文。 接着是对网页正文进行分词处理。常见的分词系统在实体识别方面存在不足,不能很好适用于知识抽取、自然语言处理等。本文分词处理使用的是西南交大思维与智慧研究所开发的分词系统,该系统在实体识别方面显著优于其它分词系统。机构名识别算法由本文作者实现,算法基于词频统计。实验中训练数据主要通过百度百科词条整理得到。训练时,作者利用百度百科词条名在词条文本中的频数统计,进行机构构成词的词频统计。在此基础上,构建了数学模型,实现了组织机构名识别算法。 最后是网页人物信息的结构化。网页上的人物信息一般以半结构化和非结构化呈现,人物信息抽取的最后部分就是抽取半结构化和非结构化的人物信息并保存为结构化的人物信息。对于半结构化人物信息,需要正文去匹配人物属性词典,然后结合简单规则,直接提取属性值就行了,方法简单而有效。对于非结构化人物信息的提取,采用基于规则的提取方法,过程中建立触发词库和规则库,触发词库包括基本人物属性和对应的触发词,规则库是人工定义的提取属性值的规则。
[Abstract]:Modern people rely more and more on retrieving information from the Internet. This paper is devoted to extracting as much important person information as possible and constructing a knowledge base of character information, which can be used as the knowledge base of the people search engine as well as the personal-related part of the knowledge base of the semantic search engine. There are a lot of people information on the network, but these information formats are various, the content is chaotic, a lot of junk information is filled in, how to extract accurate information automatically and efficiently from the Internet is relatively complex, there are many problems to be solved. In this paper, a complete process from data collection, text extraction, Chinese word segmentation to the structure of character information is studied. Each part corresponds to a chapter of the thesis. The first is the collection of web data. In this paper, the selection of the source of the people information page and the download method of the web page are described in detail. It is becoming more and more difficult to download web pages, and the restrictions on crawler programs are becoming more and more strict, and even various anti-crawler measures have been taken, such as restrictions on the frequency of access to the same IP. The author writes the program to download the web page data, and adopts three kinds of downloading ways of the web page data according to the different situation of the website: the general downloading way, the agent downloading way and the dynamic web page data downloading way. Then the text of the page is extracted. This paper summarizes the research of web page text extraction, and uses statistical and Dom methods to extract text. The statistical information used was the length of positive text, the number of hyperlinks and the number of ending punctuation marks. For each container label, after three information values are counted, the number ratio of each label is used to determine whether the label is the body label, and then the text is extracted. Then the text of the web page word segmentation processing. The common participle system is not suitable for knowledge extraction, natural language processing and so on. The word segmentation system developed by the Institute of thinking and Wisdom of Southwest Jiaotong University is better than other word segmentation systems in entity recognition. The mechanism name recognition algorithm is implemented by the author of this paper, and the algorithm is based on word frequency statistics. Training data in the experiment mainly through Baidu encyclopedia word collation. During the training, the author makes use of the frequency statistics of Baidu encyclopedia names in the entry text, and carries on the frequency statistics of the organization composition words. On the basis of this, a mathematical model is constructed and an organization name recognition algorithm is implemented. Finally, the structure of the information of people on the web page. The character information on a web page is generally presented as semi-structured and unstructured, and the final part of character information extraction is to extract semi-structured and unstructured personage information and save it as structured personage information. For semi-structured character information, we need the text to match the character attribute dictionary, and then combine the simple rules to extract the attribute value directly. The method is simple and effective. For the extraction of unstructured character information, a rule-based extraction method is adopted. In the process, a trigger lexicon and a rule base are established. The trigger lexicon includes the basic character attributes and the corresponding trigger words. A rule base is a manually defined rule for extracting attribute values.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092;TP391.1
本文编号:2133947
[Abstract]:Modern people rely more and more on retrieving information from the Internet. This paper is devoted to extracting as much important person information as possible and constructing a knowledge base of character information, which can be used as the knowledge base of the people search engine as well as the personal-related part of the knowledge base of the semantic search engine. There are a lot of people information on the network, but these information formats are various, the content is chaotic, a lot of junk information is filled in, how to extract accurate information automatically and efficiently from the Internet is relatively complex, there are many problems to be solved. In this paper, a complete process from data collection, text extraction, Chinese word segmentation to the structure of character information is studied. Each part corresponds to a chapter of the thesis. The first is the collection of web data. In this paper, the selection of the source of the people information page and the download method of the web page are described in detail. It is becoming more and more difficult to download web pages, and the restrictions on crawler programs are becoming more and more strict, and even various anti-crawler measures have been taken, such as restrictions on the frequency of access to the same IP. The author writes the program to download the web page data, and adopts three kinds of downloading ways of the web page data according to the different situation of the website: the general downloading way, the agent downloading way and the dynamic web page data downloading way. Then the text of the page is extracted. This paper summarizes the research of web page text extraction, and uses statistical and Dom methods to extract text. The statistical information used was the length of positive text, the number of hyperlinks and the number of ending punctuation marks. For each container label, after three information values are counted, the number ratio of each label is used to determine whether the label is the body label, and then the text is extracted. Then the text of the web page word segmentation processing. The common participle system is not suitable for knowledge extraction, natural language processing and so on. The word segmentation system developed by the Institute of thinking and Wisdom of Southwest Jiaotong University is better than other word segmentation systems in entity recognition. The mechanism name recognition algorithm is implemented by the author of this paper, and the algorithm is based on word frequency statistics. Training data in the experiment mainly through Baidu encyclopedia word collation. During the training, the author makes use of the frequency statistics of Baidu encyclopedia names in the entry text, and carries on the frequency statistics of the organization composition words. On the basis of this, a mathematical model is constructed and an organization name recognition algorithm is implemented. Finally, the structure of the information of people on the web page. The character information on a web page is generally presented as semi-structured and unstructured, and the final part of character information extraction is to extract semi-structured and unstructured personage information and save it as structured personage information. For semi-structured character information, we need the text to match the character attribute dictionary, and then combine the simple rules to extract the attribute value directly. The method is simple and effective. For the extraction of unstructured character information, a rule-based extraction method is adopted. In the process, a trigger lexicon and a rule base are established. The trigger lexicon includes the basic character attributes and the corresponding trigger words. A rule base is a manually defined rule for extracting attribute values.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前10条
1 刘金红;陆余良;施凡;宋舜宏;;基于语义上下文分析的因特网人物信息挖掘[J];安徽大学学报(自然科学版);2009年04期
2 黄德根;李泽中;万如;;基于SVM和CRF的双层模型中文机构名识别[J];大连理工大学学报;2010年05期
3 周俊生;戴新宇;尹存燕;陈家骏;;基于层叠条件随机场模型的中文机构名自动识别[J];电子学报;2006年05期
4 周雅倩,郭以昆,黄萱菁,吴立德;基于最大熵方法的中英文基本名词短语识别[J];计算机研究与发展;2003年03期
5 胡文博;都云程;吕学强;施水才;;基于多层条件随机场的中文命名实体识别[J];计算机工程与应用;2009年01期
6 冀高峰;汤庸;道炜;吴桂宾;黄帆;王鹏;;基于XML的自动学习Web信息抽取[J];计算机科学;2008年03期
7 刘辉;陈静玉;徐学洲;;基于模板流程配置的Web信息抽取[J];计算机工程;2008年20期
8 张华平,刘群;基于角色标注的中国人名自动识别研究[J];计算机学报;2004年01期
9 李勇军,冀汶莉,马光思;用DOM解析XML文档[J];计算机应用;2001年S1期
10 郑家恒,张辉;基于HMM的中国组织机构名自动识别[J];计算机应用;2002年11期
相关硕士学位论文 前2条
1 王江伟;基于最大熵模型的中文命名实体识别[D];南京理工大学;2005年
2 燕敏;基于语义和版式的网上人物信息提取[D];天津工业大学;2008年
,本文编号:2133947
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2133947.html