当前位置:主页 > 外语论文 > 英语论文 >

语料库短语序列提取系统的设计与开发

发布时间:2018-05-08 12:32

  本文选题:语料库驱动 + 短语序列 ; 参考:《外语电化教学》2017年04期


【摘要】:语料库短语序列提取一直是短语学研究的关键技术环节。囿于计算和操作的复杂性,前人研究多使用相对单一的统计方法测量和提取短语序列,导致提取的数据包含大量噪音。文章使用前沿的大数据处理手段和计算技术,实现了基于频数、互信息、边界熵等多种统计手段的短语序列提取方法,并研制开发了相应的系统。实验结果表明,该系统能够在普通计算机上支持千万词级规模的大型语料库运算,并能显著提高短语序列的提取质量。
[Abstract]:Phrase sequence extraction from corpus is always the key technology of phrasology. Due to the complexity of computation and operation, previous studies often use a relatively single statistical method to measure and extract phrase sequences, resulting in a large amount of noise in extracted packets. In this paper, a new method of phrase sequence extraction based on frequency, mutual information, boundary entropy and other statistical means is realized by using the advanced processing means and computing techniques of big data, and the corresponding system is developed. The experimental results show that the system can support a large corpus with a scale of ten million words on a common computer, and can improve the quality of phrase sequence extraction significantly.
【作者单位】: 北京航空航天大学;中国人民解放军后勤科学研究所;东华大学;
【基金】:国家社会科学基金项目(项目编号:13BYY074;14CYY049) 北京市社会科学基金项目(项目编号:16JDYYA001)的部分研究成果
【分类号】:H314.3;TP311.52


本文编号:1861420

资料下载
论文发表

本文链接:https://www.wllwen.com/waiyulunwen/yingyulunwen/1861420.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户9a2f7***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com