基于最大匹配的论文特征提取系统的设计与实现

发布时间：2018-12-14 03:40

【摘要】：在中文搜索引擎中，中文分词的作用显而易见，其结果直接影响到搜索引擎的性能。目前，中文分词技术主要有下面三种：通过字符串匹配进行进行分词，通过人工智能的方法在理解分词语义的基础上来进行分词，通过统计计算的方法来进行分词。所谓的中文分词系统，是现代汉语句子中的分词方法。因为现代汉语的语法习惯，汉语句子和词之间的标记表明。而英语单词与单词之间用空格，所以没有分词问题。但在中国，每一个句子，词与词问题是没有空间的，所以我们必须使用一些智能技术分离。汉语自动分词算法从十九年代至今，已成为计算机专业研究的热点，因为语言的复杂，计算机技术的瓶颈使之一直处于发展阶段。本文首先将已有的分词算法进行了分析、总结和归纳，讨论了中文识别一直难以很好解决的两大问题：歧义识别和未登录词。中文分词发展过程中遇到最大的问题是歧义识别和新词识别。中文分词的未来发展方向既要解决这类问题，使得达到较高的分词正确率，又要进行行业分词不断拓展中文分词的应用范围，通过对词频进行每个词项的出现次数后，得到该词项的特征集，设计出词频空间特征提取方法。首先利用最大匹配算法对文件进行词语切分，然后导入词频矩阵，统计词频矩阵中各项出现的频率，最后提取出文本特征。本文主要研究图书馆论文特征提取系统的开发和设计。把中文分词技术和特征提取技术应用到一起设计了可以应用到图书馆的论文特征提取系统，，并对系统的设计过程和实验结果进行了详细的介绍。应用了本系统之后，学校图书馆的论文管理变的效率更高，查找论文的速度也更快。
[Abstract]:In Chinese search engine, the function of Chinese word segmentation is obvious, and its result directly affects the performance of search engine. At present, there are three kinds of Chinese word segmentation techniques: word segmentation by string matching, word segmentation by artificial intelligence on the basis of understanding the semantics of word segmentation, and word segmentation by statistical calculation. The so-called Chinese word segmentation system is a method of word segmentation in modern Chinese sentences. Because of the grammatical habits of modern Chinese, the markers between Chinese sentences and words indicate. English words and words between the space, so there is no word segmentation problem. But in China, every sentence, word and word problem has no space, so we must use some intelligent technology to separate. Chinese automatic word segmentation algorithm has become a hot topic in computer science since the nineteen's, because of the complexity of language and the bottleneck of computer technology, it has been in the development stage. In this paper, the existing word segmentation algorithms are analyzed, summarized and summarized, and two problems which are difficult to solve in Chinese recognition are discussed: ambiguity recognition and unrecorded words. Ambiguity recognition and new word recognition are the biggest problems encountered in the development of Chinese word segmentation. The future development of Chinese word segmentation should not only solve this kind of problems, so as to achieve a higher correct rate of word segmentation, but also continue to expand the scope of application of Chinese word segmentation. The feature set of the word term is obtained, and the feature extraction method of word frequency space is designed. Firstly, the maximum matching algorithm is used to segment the file, then the word frequency matrix is imported, and the frequency of each occurrence in the word frequency matrix is counted. Finally, the text features are extracted. This paper mainly studies the development and design of library paper feature extraction system. This paper applies Chinese word segmentation technology and feature extraction technology to design a paper feature extraction system which can be applied to library. The design process and experimental results of the system are introduced in detail. With the application of this system, the paper management of the school library becomes more efficient and the search speed is faster.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】