当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于PSO-BP神经网络的Lucene搜索引擎的研究

发布时间:2019-02-23 09:37
【摘要】:Lucene是一个全文搜索体系架构,具有优异的索引结构、良好的系统架构以及高性能、可伸缩的信息搜索库等优点,但是对于中文分词以及多种文本格式的支持却很是不足。目前Lucene采用的中文分词算法有很多,包括Lucene自身提供的StandardAnalyzer和CJKAnalyzer,以及第三方提供的ChineseAnalyzer和IK_CAnalyzer等等很多种中文分词系统。其中,StandardAnalyzer是基于单字分词的,即在对中文文本进行分词时,以字为单位进行切分,其缺点是需要复杂的单字匹配算法,以及大量的CPU运算;CJKAnalyzer和ChineseAnalyzer采用的均是二分法,所谓二分法就是每每两个字当作一个词来切分;IK_CAnalyzer分词技术是基于分词词典的,采用了特有的正向迭代最细粒度切分算法和多子处理器分析模式。目前,Lucene搜索引擎并未实现基于理解的中文分词方法,因为计算机无法识别每个词在不同语境中的含义,所以基于理解的分词方法还未有实际的运用效果。 针对Lucene对中文分词的不足,尤其是缺少基于理解领域的中文分词技术等缺陷,本文探讨了BP(Back Propagation)神经网络算法在中文分词中的应用研究,并针对BP神经网络应用中文分词具有收敛速度慢,容易陷入局部极小值以及速度和效率低等缺陷,提出了一种改进的微粒群优化算法(PSO, Particle SwarmOptimization)优化BP神经网络——PSO-BP神经网络,并将其运用于中文分词中,与传统的BP神经网络相比较,可以得出PSO-BP神经网络不仅解决了传统BP神经网络收敛速度慢的缺陷,同时也提高了分词的精度。 然后,本文对Lucene提供的第三方中文分词组件的API进行了系统地研究与分析,将经PSO-BP神经网络优化后的中文分词技术成功应用于Lucene中,并与Lucene自带的中文分词技术进行比较,得出该技术明显优于自带的中文分词技术。 最后,,本文采用包含PSO-BP神经网络中文分词组件的Lucene进行搜索引擎的设计和实现,从而实现搜索引擎的中文分词的智能化探索,为后续的工作和研究提供了一个良好的平台。
[Abstract]:Lucene is a full-text search architecture with excellent index structure, good system architecture and high performance, scalable information search library. However, the support for Chinese word segmentation and various text formats is very inadequate. At present, there are many Chinese word segmentation algorithms used in Lucene, including StandardAnalyzer and CJKAnalyzer, provided by Lucene itself and ChineseAnalyzer and IK_CAnalyzer provided by third parties. Among them, StandardAnalyzer is based on word segmentation, that is to say, word segmentation is based on word segmentation. Its disadvantage is that it needs complex word matching algorithm and a large number of CPU operations. CJKAnalyzer and ChineseAnalyzer use dichotomy, so called dichotomy is each word as a word to divide; The word segmentation technology of IK_CAnalyzer is based on the word segmentation dictionary, and adopts the special forward iterative finest granularity segmentation algorithm and the analysis mode of multiple sub-processors. At present, the Lucene search engine has not realized the Chinese word segmentation method based on understanding, because the computer can not recognize the meaning of each word in different context, so the word segmentation method based on understanding has no practical application effect. In view of the deficiency of Lucene in Chinese word segmentation, especially the lack of Chinese word segmentation technology based on understanding, this paper discusses the application of BP (Back Propagation) neural network algorithm in Chinese word segmentation. Aiming at the shortcomings of BP neural network in the application of Chinese word segmentation, such as slow convergence, easy to fall into local minima, and low speed and efficiency, an improved particle swarm optimization algorithm (PSO,) is proposed. Particle SwarmOptimization) optimizes BP neural network, PSO-BP neural network, and applies it to Chinese word segmentation. Compared with traditional BP neural network, PSO-BP neural network not only solves the problem of slow convergence speed of traditional BP neural network. At the same time, the accuracy of word segmentation is improved. Then, the API of the third-party Chinese word segmentation component provided by Lucene is systematically studied and analyzed in this paper. The Chinese word segmentation technology optimized by PSO-BP neural network is successfully applied to Lucene, and compared with the Chinese word segmentation technology provided by Lucene. The result shows that this technique is superior to the Chinese word segmentation technology. Finally, this paper uses Lucene which includes PSO-BP neural network Chinese word segmentation component to design and implement the search engine, so as to realize the intelligent exploration of Chinese word segmentation of search engine, which provides a good platform for the follow-up work and research.
【学位授予单位】:中国石油大学(华东)
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3;TP183

【参考文献】

相关期刊论文 前10条

1 龚汉明,周长胜;汉语分词技术综述[J];北京机械工业学院学报;2004年03期

2 余华;曹亮;李启元;;BP神经网络算法的改进及其在手写体汉字识别中的应用[J];江西师范大学学报(自然科学版);2009年05期

3 周平;;Lucene全文检索引擎技术及应用[J];重庆工学院学报(自然科学版);2007年04期

4 于洪波;;中文分词技术研究[J];东莞理工学院学报;2010年05期

5 张利;张立勇;张晓淼;耿铁锁;岳宗阁;;基于改进BP网络的中文歧义字段分词方法研究[J];大连理工大学学报;2007年01期

6 刘玲;严登俊;龚灯才;张红梅;李大鹏;;基于粒子群模糊神经网络的短期电力负荷预测[J];电力系统及其自动化学报;2006年03期

7 姚李孝,宋玲芳,李庆宇,万诗新;基于模糊聚类分析与BP网络的电力系统短期负荷预测[J];电网技术;2005年01期

8 丁丽;相玉红;黄安民;张卓勇;;BP神经网络与近红外光谱定量预测杉木中的综纤维素、木质素、微纤丝角[J];光谱学与光谱分析;2009年07期

9 王欣;叶华俊;黎庆涛;谢锦春;卢家炯;夏阿林;王健;;近红外光谱结合人工神经网络分析蔗汁的锤度和旋光度[J];光谱学与光谱分析;2010年07期

10 严文娟;张晶;胡广芹;赵静;林凌;陆小左;李刚;;BP神经网络用于肝炎患者舌诊近红外光谱的研究[J];光谱学与光谱分析;2010年10期



本文编号:2428689

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2428689.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户bba29***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com