语料库短语序列提取系统的设计与开发
发布时间:2018-05-08 12:32
本文选题:语料库驱动 + 短语序列 ; 参考:《外语电化教学》2017年04期
【摘要】:语料库短语序列提取一直是短语学研究的关键技术环节。囿于计算和操作的复杂性,前人研究多使用相对单一的统计方法测量和提取短语序列,导致提取的数据包含大量噪音。文章使用前沿的大数据处理手段和计算技术,实现了基于频数、互信息、边界熵等多种统计手段的短语序列提取方法,并研制开发了相应的系统。实验结果表明,该系统能够在普通计算机上支持千万词级规模的大型语料库运算,并能显著提高短语序列的提取质量。
[Abstract]:Phrase sequence extraction from corpus is always the key technology of phrasology. Due to the complexity of computation and operation, previous studies often use a relatively single statistical method to measure and extract phrase sequences, resulting in a large amount of noise in extracted packets. In this paper, a new method of phrase sequence extraction based on frequency, mutual information, boundary entropy and other statistical means is realized by using the advanced processing means and computing techniques of big data, and the corresponding system is developed. The experimental results show that the system can support a large corpus with a scale of ten million words on a common computer, and can improve the quality of phrase sequence extraction significantly.
【作者单位】: 北京航空航天大学;中国人民解放军后勤科学研究所;东华大学;
【基金】:国家社会科学基金项目(项目编号:13BYY074;14CYY049) 北京市社会科学基金项目(项目编号:16JDYYA001)的部分研究成果
【分类号】:H314.3;TP311.52
,
本文编号:1861420
本文链接:https://www.wllwen.com/waiyulunwen/yingyulunwen/1861420.html