基于LDA的蒙古文信息检索方法研究与系统实现
发布时间:2018-05-09 22:33
本文选题:蒙古语 + LDA主题模型 ; 参考:《内蒙古师范大学》2016年硕士论文
【摘要】:随着网络技术的不断发展及信息的全球化,使我们能随时随地从互联网上获取自己所需的信息,带来了极大便利,同时也推动了少数民族语言文字网络化应用的发展,对少数民族语言文字顺应信息时代发展的需求和搜索引擎的发展起着十分积极的作用。蒙古文是我国影响力较高的少数民族语言文字之一,近年来随着网络上蒙古文信息的日益丰富,如何在大量的网络信息资源中快速、准确地找出满足用户需求的蒙古文信息,是当前蒙古文信息检索技术需要迫切解决的问题。传统的蒙古文信息检索系统更多基于关键词匹配进行检索,仅考虑词与词之间的字面匹配,未充分利用词之间语义层面的关联信息。事实上,不同用户使用同样关键词来描述同一对象的概率往往小于20%,并且蒙古文语言表达形式多样,一词多意与多词一意现象较普遍,这使得查询结果与用户所需信息差距较大,造成检索效果不佳。针对上述问题,本文主要从挖掘文档主题语义信息方面寻找解决方案,通过LDA主题模型提取文档中隐含的主题和主题共现关系,从而利用文档的隐含主题语义信息为检索服务,改善检索效果。具体工作说明如下:本文提出了一种LDA主题模型与语言模型相结合的蒙古文信息检索方法。该方法首先对蒙古文文本建立一元和二元语言模型,得到文本的语言概率分布;然后基于LDA建立主题模型,利用吉普斯抽样方法计算模型的参数,挖掘得到文档隐含的主题概率分布;最后,计算出文档主题分布与语言分布的线性组合概率分布,以此分布来计算文档主题与查询关键词之间的相似度,最后返回与查询关键词主题最相关的文档。方法中语言模型能充分利用蒙古文语法特征,而LDA主题模型有良好的主题发现及泛化学习能力,结合这两种方法能更好地实现蒙古文文档的主题语义检索,提高检索准确性。通过在国际编码标准的小学蒙语文教材语料测试集上进行实验,结果表明相对于传统的基于关键词和独立使用LDA主题模型的信息检索方法,本文方法提高了信息检索的准确率与召回率,验证了方法的有效性与实用性。在此基础上,本文还设计实现了面向教育应用的蒙语文教材语料库信息检索系统,该系统采用Java Web框架设计实现,能对语料库内容进行全文检索,以及按标题、版本号、出版社、教育阶段等条目进行数据库检索,检索结果页面能按传统蒙古文的习惯从左到右竖排显示,相关内容能高亮显示。
[Abstract]:With the continuous development of network technology and the globalization of information, we can get the information we need from the Internet anytime and anywhere, which brings great convenience, and also promotes the development of the network application of minority languages and characters. It plays an active role in meeting the needs of the development of the information age and the development of search engines. Mongolian is one of the most influential minority languages in China. In recent years, with the increasing enrichment of Mongolian information on the Internet, how to quickly and accurately find Mongolian information to meet the needs of users in a large number of network information resources. At present, Mongolian information retrieval technology needs to be solved urgently. The traditional Mongolian information retrieval system is more based on keyword matching, only considering the literal matching between words, and does not make full use of the semantic level of related information between words. In fact, the probability of different users using the same keyword to describe the same object is often less than 20, and the Mongolian language has various forms of expression. This results in a large gap between the query results and the information required by the user, resulting in poor retrieval results. In view of the above problems, this paper mainly looks for the solution from the aspect of mining document topic semantic information, extracts the implied topic and topic co-occurrence relation through the LDA topic model, and then uses the document implicit topic semantic information for the retrieval service. Improve the retrieval effect. The main work is as follows: this paper presents a Mongolian information retrieval method which combines LDA subject model with language model. In this method, the monadic and binary language models are established for Mongolian text, and the linguistic probability distribution of the text is obtained, and then the subject model based on LDA is established, and the parameters of the model are calculated by using Gyibug sampling method. Finally, the linear combination probability distribution of document topic distribution and language distribution is calculated to calculate the similarity between document topic and query keywords. At last, we return the document most relevant to the key topic of the query. In this method, the language model can make full use of the Mongolian grammatical features, while the LDA topic model has good topic discovery and generalization learning ability. Combining these two methods, the topic semantic retrieval of Mongolian documents can be better realized and the retrieval accuracy can be improved. The experiment is carried out on the corpus test set of primary school Mongolian language teaching materials in international coding standard. The results show that compared with the traditional information retrieval method based on keyword and independent use of LDA subject model, This method improves the accuracy and recall rate of information retrieval, and verifies the effectiveness and practicability of the method. On this basis, this paper also designs and implements a corpus information retrieval system for Mongolian Chinese teaching materials oriented to educational applications. The system is designed and implemented by Java Web framework, which can retrieve the content of the corpus in full text, as well as according to the title and version number. Publishing house, education stage and other items are searched in database. The retrieval result page can be displayed vertically from left to right according to the traditional Mongolian custom, and the relevant contents can be highlighted.
【学位授予单位】:内蒙古师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3
,
本文编号:1867736
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1867736.html