中文分词在农业垂直搜索引擎中的应用研究
本文选题:中文分词 切入点:农业垂直搜索引擎 出处:《新疆农业大学》2013年硕士论文
【摘要】:本文首先对现有中文分词相关理论和方法以及存在的主要问题进行了深入分析,重点研究了统计模型在自然语言处理领域中的应用。在此基础上针对农业垂直搜索领域的特殊需求及应用环境,提出了基于词典和统计语言模型的中文分词方法。该方法通过改进的全切分算法建立分词矩阵,实现了所有类型歧义的识别,生成粗分结果集,然后利用N元语法模型从中选择概率最大的切分结果,经过基于最大熵模型的词位标注方法识别未登录词后得出最终的分词结果,最后给出了基于此方法的中文分词原型系统的设计与实现。 本文提出的分词方法在三个方面进行了改进,首先是通过大规模语料库识别具有切分标志意义的特征字建立特征字库,通过特征字对经过预处理的语句集进行初步切分,有效降低了粗分阶段的字串长度。其次采用改进的全切分模型,通过字位标注建立分词矩阵,能够有效检测歧义边界,识别所有类型的歧义,并筛选出包含歧义的切分形式,通过bigram模型进行概率计算,选择最优切分形式。最后,建立农业专业术语、中文人名、中文机构名专业词库,统计构词规律,选择合适的特征模板,生成样本数据,利用词位标注的思想,通过最大熵模型实现对未登录词的识别。 本文设计了三个方面的实验,,对改进的全切分算法和传统全切分算法的性能进行比较;在4词位标注集上选择不同的上下文窗口宽度对基于最大熵模型的未登录词识别率的比较;将该原型系统与ICTCLAS、Paoding以及IKAnalyzer进行综合性能的比较。实验结果表明,使用本文提出的分词模型的原型系统召回率达到93.6%,准确率达到91.7%,F1测度值为92.6%,未登录词的召回率为77.2%,未登录词准确率为90.1%。
[Abstract]:In this paper, the existing theories and methods of Chinese word segmentation and the main problems are analyzed. This paper focuses on the application of statistical model in the field of natural language processing. On this basis, it aims at the special needs and application environment in the field of vertical agricultural search. This paper proposes a Chinese word segmentation method based on dictionary and statistical language model, which establishes the segmentation matrix through the improved total segmentation algorithm, realizes the recognition of all types of ambiguity, and generates the rough result set. Then we use N-meta grammar model to select the segmentation result with the greatest probability, and get the final segmentation result after recognizing the unregistered words by the word location tagging method based on the maximum entropy model. Finally, the design and implementation of Chinese word segmentation prototype system based on this method are presented. The method of word segmentation proposed in this paper is improved in three aspects. Firstly, the feature database is established by large scale corpus recognition of feature words with the meaning of segmentation markers, and the pre-processed sentence set is segmented by feature words. The length of string in coarse stages is reduced effectively. Secondly, an improved total segmentation model is adopted, and word segmentation matrix is established by word tagging, which can effectively detect ambiguity boundaries, identify all types of ambiguity, and screen out segmentation forms that contain ambiguity. The bigram model is used to calculate the probability and select the optimal segmentation form. Finally, the specialized lexicon of agricultural terms, Chinese names, Chinese institutional names, statistical word-formation rules are established, and appropriate feature templates are selected to generate sample data. Using the idea of tagging words, the maximum entropy model is used to realize the recognition of unrecorded words. In this paper, three experiments are designed to compare the performance of the improved total segmentation algorithm and the traditional total segmentation algorithm. Selecting different context window width on 4-word tagging set to compare the recognition rate of unrecorded words based on maximum entropy model, and comparing the performance of the prototype system with ICTCLASS-Paoding and IKAnalyzer. The experimental results show that, The prototype system using the participle model proposed in this paper has a recall rate of 93.6, an accuracy of 91.7 / F _ 1 and a value of 92.6, a recall rate of 77.2 for unrecorded words and a accuracy of 90.1 for unrecorded words.
【学位授予单位】:新疆农业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
中国期刊全文数据库 前10条
1 张文慧;张冉;;基于中文分词的农业信息检索平台设计[J];安徽农业科学;2011年20期
2 于江德;王希杰;;词位标注汉语分词技术详解[J];安阳师范学院学报;2010年05期
3 张德鑫;“水至清则无鱼”——我的新生词语规范观[J];北京大学学报(哲学社会科学版);2000年05期
4 张锋,樊孝忠;基于最大熵模型的交集型切分歧义消解[J];北京理工大学学报;2005年07期
5 曹月雷;纪文彦;贾斌;;词典与后缀数组相结合的中文分词方法[J];硅谷;2012年21期
6 刘群,张华平,俞鸿魁,程学旗;基于层叠隐马模型的汉语词法分析[J];计算机研究与发展;2004年08期
7 刘迁;贾惠波;;中文信息处理中自动分词技术的研究与展望[J];计算机工程与应用;2006年03期
8 曹波;苏一丹;邓琦;;基于最大熵模型的中国人名自动识别[J];计算机工程与应用;2009年04期
9 李国和;刘光胜;秦波波;吴卫江;李洪奇;;综合最大匹配和歧义检测的中文分词粗分方法[J];计算机工程与应用;2012年14期
10 张仰森;;基于最大熵模型的汉语词义消歧与标注方法[J];计算机工程;2009年18期
中国重要会议论文全文数据库 前1条
1 黄昌宁;赵海;;由字构词——中文分词新方法[A];中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集[C];2006年
本文编号:1657432
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1657432.html