基于垂直主题搜索的交通术语相似性比对研究

发布时间：2018-07-14 15:48

【摘要】：各研究领域内的名词和标准术语之间的相似度计算，是开展各个专业领域内的数据挖掘、自然语言处理的前提和基础。Web-PMI是一种基于搜索引擎的命中数计算术语相似度的算法，利用搜索引擎对术语检索的返回命中数就可以对术语对的相似性进行量化地计算。但基于大型的通用搜索引擎对特定的领域限定术语检索命中数不足，这往往对术语的相似度计算造成影响，本文旨在通过建立交通主题的垂直搜索引擎系统，提高术语检索命中效果，从而提高术语相似度的计算精度目的。论文首先研究并实现了基于交通主题的垂直搜索引擎的构建。其主要工作是在交通领域内对包含交通术语的网页进行抓取，论文在开源爬虫程序Heritrix项目的架构下自主开发了交通主题的网页抓取程序，实现了交通主题限定的网页抓取。其次完成了对抓取的网页信息进行格式解析，过滤掉网页中的冗余信息，构建了检索系统的索引库。本文建立的索引库是在开源Lucene条件下编写索引程序，对解析后的交通主题网页建立有序的索引，并能实现交通术语在索引库中的全文检索，检索后返回术语在索引库中具体的命中数值。最后利用Web-PMI算法进行交通标准术语的相似度计算的实验，在算法中重新构造了基于交通术语的检索式，，加入检索运算符，减少检索结果中的歧义发生，提高检索结果的领域相关度，提升算法效果。对实验结果进行分析，改进后检索式提升了术语的检索命中数，消除了一定的术语偶然共现情况对术语相似度计算的效果影响。本文提出的方法，在“交通信息一致性检测研究”项目中进行了应用，应用结果证明，基于本文建立的交通垂直主题的搜索引擎系统，对交通领域内的生僻术语进行相似度计算时能起到很好效果，较商业搜索引擎Alta Vista的计算准确率也略高。本文提出的方法也同样适用于其他专业领域内术语相似度计算，同时也可对术语标准化、识别同义词与近义词、语义检索、术语标准类比检测等方面的工作进行有效地支持。
[Abstract]:The similarity calculation between the nouns and the standard terms in each field is to carry out data mining in various professional fields. The premise and foundation of Natural Language Processing is an algorithm based on the number of terms of the hit number of the search engine to calculate the similarity of terms. The number of return hits by the search engine for the term retrieval can be used for the terminology. The similarity is quantified. However, based on a large general search engine, the number of terminology is limited to a specific domain, which often affects the similarity calculation of terms. This paper aims to improve the effect of terminology retrieval by establishing a vertical search engine system for traffic topics to improve the terms similarity. The purpose of precision is to be calculated.
The thesis first studies and realizes the construction of vertical search engine based on traffic theme. Its main work is to grab web pages containing traffic terms in the field of traffic. The paper develops the web crawling program of traffic subject under the framework of Heritrix project of open source crawler program. Take.
Secondly, the web page information was formatted and the redundant information was filtered out, and the index library of the retrieval system was constructed. The index library established in this paper is to write the index program under the condition of open source Lucene, to establish an orderly search for the parsed traffic topic web page, and to realize the full text of the traffic terms in the index library. Retrieves and retrieves the specific hit number of the term in the index base.
Finally, we use the Web-PMI algorithm to carry out the experiment of similarity calculation of traffic standard terms. In the algorithm, the retrieval formula based on traffic terms is re constructed, and the retrieval operator is added to reduce the occurrence of ambiguity in the retrieval results, improve the domain correlation of the retrieval results and improve the effect of the algorithm. The experimental results are analyzed and the improved retrieval formula is proposed. The retrieval number of terminology is increased, and the effect of term coincidence on the computation of terminology similarity is eliminated.
The method proposed in this paper is applied to the "traffic information consistency detection research" project. The application results show that the search engine system based on the traffic vertical theme based on this paper can play a very good effect on the similarity calculation of the unsocial terminology in the traffic field, compared with the calculation accuracy of the commercial search engine Alta Vista. The method proposed in this paper is also applicable to the calculation of terminology similarity in other specialized fields, and it can also effectively support the work of terminology standardization, identification of synonyms and synonyms, semantic retrieval, and Terminology Standard analogical detection.
【学位授予单位】：长安大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1;U11-61

【参考文献】