基于最大匹配的论文特征提取系统的设计与实现
发布时间:2018-12-14 03:40
【摘要】:在中文搜索引擎中,中文分词的作用显而易见,其结果直接影响到搜索引擎的性能。目前,中文分词技术主要有下面三种:通过字符串匹配进行进行分词,通过人工智能的方法在理解分词语义的基础上来进行分词,通过统计计算的方法来进行分词。所谓的中文分词系统,是现代汉语句子中的分词方法。因为现代汉语的语法习惯,汉语句子和词之间的标记表明。而英语单词与单词之间用空格,所以没有分词问题。但在中国,每一个句子,词与词问题是没有空间的,所以我们必须使用一些智能技术分离。汉语自动分词算法从十九年代至今,已成为计算机专业研究的热点,因为语言的复杂,计算机技术的瓶颈使之一直处于发展阶段。 本文首先将已有的分词算法进行了分析、总结和归纳,讨论了中文识别一直难以很好解决的两大问题:歧义识别和未登录词。中文分词发展过程中遇到最大的问题是歧义识别和新词识别。中文分词的未来发展方向既要解决这类问题,使得达到较高的分词正确率,又要进行行业分词不断拓展中文分词的应用范围,通过对词频进行每个词项的出现次数后,得到该词项的特征集,设计出词频空间特征提取方法。首先利用最大匹配算法对文件进行词语切分,然后导入词频矩阵,统计词频矩阵中各项出现的频率,最后提取出文本特征。 本文主要研究图书馆论文特征提取系统的开发和设计。把中文分词技术和特征提取技术应用到一起设计了可以应用到图书馆的论文特征提取系统,,并对系统的设计过程和实验结果进行了详细的介绍。应用了本系统之后,学校图书馆的论文管理变的效率更高,查找论文的速度也更快。
[Abstract]:In Chinese search engine, the function of Chinese word segmentation is obvious, and its result directly affects the performance of search engine. At present, there are three kinds of Chinese word segmentation techniques: word segmentation by string matching, word segmentation by artificial intelligence on the basis of understanding the semantics of word segmentation, and word segmentation by statistical calculation. The so-called Chinese word segmentation system is a method of word segmentation in modern Chinese sentences. Because of the grammatical habits of modern Chinese, the markers between Chinese sentences and words indicate. English words and words between the space, so there is no word segmentation problem. But in China, every sentence, word and word problem has no space, so we must use some intelligent technology to separate. Chinese automatic word segmentation algorithm has become a hot topic in computer science since the nineteen's, because of the complexity of language and the bottleneck of computer technology, it has been in the development stage. In this paper, the existing word segmentation algorithms are analyzed, summarized and summarized, and two problems which are difficult to solve in Chinese recognition are discussed: ambiguity recognition and unrecorded words. Ambiguity recognition and new word recognition are the biggest problems encountered in the development of Chinese word segmentation. The future development of Chinese word segmentation should not only solve this kind of problems, so as to achieve a higher correct rate of word segmentation, but also continue to expand the scope of application of Chinese word segmentation. The feature set of the word term is obtained, and the feature extraction method of word frequency space is designed. Firstly, the maximum matching algorithm is used to segment the file, then the word frequency matrix is imported, and the frequency of each occurrence in the word frequency matrix is counted. Finally, the text features are extracted. This paper mainly studies the development and design of library paper feature extraction system. This paper applies Chinese word segmentation technology and feature extraction technology to design a paper feature extraction system which can be applied to library. The design process and experimental results of the system are introduced in detail. With the application of this system, the paper management of the school library becomes more efficient and the search speed is faster.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
本文编号:2377849
[Abstract]:In Chinese search engine, the function of Chinese word segmentation is obvious, and its result directly affects the performance of search engine. At present, there are three kinds of Chinese word segmentation techniques: word segmentation by string matching, word segmentation by artificial intelligence on the basis of understanding the semantics of word segmentation, and word segmentation by statistical calculation. The so-called Chinese word segmentation system is a method of word segmentation in modern Chinese sentences. Because of the grammatical habits of modern Chinese, the markers between Chinese sentences and words indicate. English words and words between the space, so there is no word segmentation problem. But in China, every sentence, word and word problem has no space, so we must use some intelligent technology to separate. Chinese automatic word segmentation algorithm has become a hot topic in computer science since the nineteen's, because of the complexity of language and the bottleneck of computer technology, it has been in the development stage. In this paper, the existing word segmentation algorithms are analyzed, summarized and summarized, and two problems which are difficult to solve in Chinese recognition are discussed: ambiguity recognition and unrecorded words. Ambiguity recognition and new word recognition are the biggest problems encountered in the development of Chinese word segmentation. The future development of Chinese word segmentation should not only solve this kind of problems, so as to achieve a higher correct rate of word segmentation, but also continue to expand the scope of application of Chinese word segmentation. The feature set of the word term is obtained, and the feature extraction method of word frequency space is designed. Firstly, the maximum matching algorithm is used to segment the file, then the word frequency matrix is imported, and the frequency of each occurrence in the word frequency matrix is counted. Finally, the text features are extracted. This paper mainly studies the development and design of library paper feature extraction system. This paper applies Chinese word segmentation technology and feature extraction technology to design a paper feature extraction system which can be applied to library. The design process and experimental results of the system are introduced in detail. With the application of this system, the paper management of the school library becomes more efficient and the search speed is faster.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 龚汉明,周长胜;汉语分词技术综述[J];北京机械工业学院学报;2004年03期
2 刘海峰;王元元;;一种基于统计的汉语切词方法[J];工程地质计算机应用;2006年02期
3 欧振猛,余顺争;中文分词算法在搜索引擎应用中的研究[J];计算机工程与应用;2000年08期
4 应志伟,柴佩琪,陈其晖;文语转换系统中基于语料的汉语自动分词研究[J];计算机应用;2000年02期
5 马玉春,宋瀚涛;Web中文文本分词技术研究[J];计算机应用;2004年04期
6 邹海山,吴勇,吴月珠,陈阵;中文搜索引擎中的中文信息处理技术[J];计算机应用研究;2000年12期
7 曹倩,丁艳,王超,潘金贵;汉语自动分词研究及其在信息检索中的应用[J];计算机应用研究;2004年05期
8 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期
9 曹红兵;;新一代搜索引擎UJIK0[J];图书馆建设;2007年02期
10 于海燕;陈晓江;冯健;房鼎益;;Web文本内容过滤方法的研究[J];微电子学与计算机;2006年09期
相关硕士学位论文 前1条
1 于洪杰;垃圾邮件过滤技术算法研究[D];大连海事大学;2007年
本文编号:2377849
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2377849.html