当前位置:主页 > 科技论文 > 软件论文 >

基于Hadoop的汉语词语搭配抽取系统的研究与实现

发布时间:2018-09-01 05:42
【摘要】:搭配是一种重复出现、遵从一定句法结构但又具有任意性、不可类推的词语组合。搭配抽取是指通过计算机的计算能力和程序设计语言从语料库中自动提取搭配。随着计算机技术的快速发展,自动抽取搭配已经成为人们越来越重视的自然语言处理任务。一方面,词语搭配抽取研究在自然语言处理领域的诸多应用如机器翻译、词义消歧、语言生成和信息检索等方面起着重要作用,此外,词语搭配对于语言教学、二语习得也有着十分重要的辅助作用。另一方面,随着互联网数据和大规模语料库成为计算语言学搭配研究的重要知识来源,互联网数据井喷式增长和语料库规模的不断扩大使得开发出有效的方法来实现搭配的自动抽取显得尤为重要。本文从Google研究所的n-gram语料库三元组数据出发,以自动抽取汉语实词类典型搭配为目的,利用Hadoop分布式计算平台关键技术为主导,综合汉语语言学知识,并借鉴统计学方法,研究了基于java Web和Hadoop的分布式词语搭配检索系统,为用户提供了一种智能、便捷获取词语搭配信息的新途径。主要研究内容包括首先,对现有的统计学词语搭配抽取方法与Hadoop分布式平台关键技术进行阐述,对这些方法的优缺点进行比较分析,引入介绍搭配抽取的评估指标:准确率、召回率和F值。其次,结合汉语语言学知识和语料库内容,通过分析搭配词语间词性构成规则,选取汉语实词的典型搭配类型,给出汉语实词搭配的词性构成描述。最后,实验部分给出从n-gram语料库中抽取汉语实词典型搭配的具体实现方法。主要研究成果如下:(1)借鉴统计学的搭配抽取方法和Hadoop分布式平台相关技术,结合汉语语言学搭配词性构成规则,实现了搭配自动抽取的具体化。本文在MapReduce模式下去除稀疏数据和非中文数据,调用NLPIR汉语分词系统进行分词和词性标注,实现语料预处理,选择跨距提取候选搭配集,利用搭配词性构成规则筛选实词类搭配,并根据三种统计学方法——共现频次、互信息和卡方检验公式计算统计量。采用HBase分布式数据库对抽取的中间结果和最终结果进行存储,构建了汉语词语搭配用户词典。(2)开发了基于Hadoop的汉语词语搭配抽取系统的前台,便于用户有效获取搭配信息。使用bootstrap开发框架设计了前台页面,实现了词语检索区域条件设置和结果展示功能。(3)总结了一种以实词为中心词的典型搭配的抽取方法,将这一大数据技术、语言学知识和统计学方法综合的方法运用于四类实词名词、动词、形容词和副词搭配抽取实验,通过定量比较分析,得出基于共现频率方法抽取搭配的实验结果最优,其中名词类搭配抽取的准确率是86%,召回率是59.72%,F值是70.49%,动词类搭配抽取的准确率是80%,召回率是65.57%,F值是72.07%,形容词类抽取准确率是82%,召回率是78.85%,F值是80.39%,副词类准确率是88%,召回率是43.56%,F值是58.28%,其中形容词和名词类抽取的准确率较现有搭配抽取软件高了2%-4%,说明该方法在汉语搭配自动抽取方面具有一定价值。
[Abstract]:Collocation is a repetitive, syntactic, but arbitrary, non-analogous combination of words. Collocation extraction refers to the automatic extraction of collocations from a corpus by computer computing power and programming language. With the rapid development of computer technology, automatic extraction of collocations has become more and more important. On the one hand, collocation extraction plays an important role in many applications in natural language processing, such as machine translation, word sense disambiguation, language generation and information retrieval. On the other hand, collocation plays an important role in language teaching and second language acquisition. Data and large-scale corpus are important sources of knowledge in Computational Linguistics collocation research. The explosive growth of Internet data and the continuous expansion of corpus size make it particularly important to develop effective methods for automatic collocation extraction. To extract typical collocations of Chinese substantive parts, a distributed word collocation retrieval system based on Java Web and Hadoop is studied by using the key technology of Hadoop distributed computing platform as the leading factor, integrating the knowledge of Chinese linguistics and referring to statistical methods. This system provides a new intelligent and convenient way for users to obtain collocation information. The research contents include: firstly, the existing statistical word collocation extraction methods and the key technologies of Hadoop distributed platform are described, the advantages and disadvantages of these methods are compared and analyzed, and the evaluation indicators of collocation extraction are introduced: accuracy, recall and F value. This paper analyzes the rules of part-of-speech formation between collocation words, selects the typical collocation types of Chinese notional words, and gives the description of the part-of-speech formation of Chinese notional words collocation. Finally, the experimental part gives the concrete implementation method of extracting Chinese notional lexical collocation from n-gram corpus. In this paper, sparse data and non-Chinese data are removed from the MapReduce model, and the NLPIR Chinese word segmentation system is called for word segmentation and part-of-speech tagging to realize corpus preprocessing, select the candidate collocation set for cross-distance extraction, and make use of lap. The matching rules are used to filter the collocation of real parts of speech, and the statistics are calculated according to three statistical methods: co-occurrence frequency, mutual information and chi-square test formula. The intermediate and final results are stored in HBase distributed database, and a Chinese word collocation user dictionary is constructed. (2) Hadoop-based Chinese word collocation dictionary is developed. The front-end page of the collocation extraction system is designed with the bootstrap development framework, and the function of setting the conditions of the word retrieval area and displaying the results is realized. (3) A typical collocation extraction method based on the content words is summarized, and this data technology, linguistic knowledge and statistics are used. Methods The comprehensive method was applied to four types of noun, verb, adjective and adverb collocation extraction experiments. Through quantitative comparative analysis, it was found that collocation extraction based on co-occurrence frequency method was the best. The accuracy rate of noun collocation extraction was 86%, recall rate was 59.72%, F value was 70.49%, verb collocation extraction was 80%. The recall rate is 65.57%, the F value is 72.07%, the accuracy of adjective extraction is 82%, the recall rate is 78.85%, the F value is 80.39%, the accuracy of adverbs is 88%, the recall rate is 43.56%, the F value is 58.28%. The accuracy of adjective and noun extraction is 2% - 4% higher than that of the existing collocation extraction software. Certain value.
【学位授予单位】:长江大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【相似文献】

相关期刊论文 前8条

1 曲维光,陈小荷,吉根林;基于框架的词语搭配自动抽取方法[J];计算机工程;2004年23期

2 乃禾;词语搭配要得当[J];新闻通讯;1984年03期

3 王漫宇;;辞忌失朋[J];新闻战线;1982年11期

4 邓耀臣,王同顺;词语搭配抽取的统计方法及计算机实现[J];外语电化教学;2005年05期

5 王璐;张仰森;;基于典型句型的词语搭配定量分析及提取算法[J];计算机科学;2012年S1期

6 高明阳;;浅谈英语词语搭配和教学[J];甘肃科技纵横;2012年01期

7 罗琴琴;周江林;;基于语料库的词语搭配研究综述[J];外语教育;2005年00期

8 王素格;杨军玲;张武;;自动获取汉语词语搭配[J];中文信息学报;2006年06期

相关重要报纸文章 前5条

1 谭志龙;句子中,词语搭配有讲究[N];语言文字周报;2013年

2 小波;助你解决词语搭配困惑[N];中国图书商报;2002年

3 《语言文字报》原主编 杜永道;权力与权利[N];人民日报海外版;2011年

4 卡克西·海尔江 (哈萨克族) 努尔巴汗 译;在翻译中要注意文化差异[N];文艺报;2013年

5 张辉 李国清 陈群安;“只字关天”[N];湖北日报;2004年

相关博士学位论文 前3条

1 冯奇;核心句的词语搭配研究[D];上海外国语大学;2006年

2 申修瑛;现代汉语词语搭配研究[D];复旦大学;2007年

3 徐润华;基于词语搭配知识和语法功能匹配的句法分析器[D];南京师范大学;2013年

相关硕士学位论文 前10条

1 张晓花;藏语形容词的结构及搭配库构建研究[D];西北民族大学;2016年

2 刘慧平;注释方式和任务投入量对高中学生英语词语搭配附带习得的影响[D];扬州大学;2017年

3 梁君华;高级阶段词语搭配的输出及其对外语教学的启示[D];上海外国语大学;2005年

4 Diana Batsenkova;中文为外语翻译中的词语搭配错误[D];上海外国语大学;2014年

5 李献慧;中国不同阶段学生英语词语搭配现状研究[D];华北电力大学(北京);2011年

6 朱鑫;词语搭配自动抽取方法对比研究[D];大连海事大学;2011年

7 李然;英语词语搭配教学干预对大学英语写作的影响[D];北京林业大学;2012年

8 周智慧;多项选择注释和单项注释对附带词语搭配学习的影响[D];华南理工大学;2012年

9 周莎莎;母语习得者与二语习得者写作中词语搭配的描述性研究[D];贵州大学;2009年

10 司云伟;词语搭配及搭配不当实例分析[D];延边大学;2003年



本文编号:2216281

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2216281.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c1314***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com