基于垂直主题搜索的交通术语相似性比对研究
[Abstract]:The similarity calculation between the nouns and the standard terms in each field is to carry out data mining in various professional fields. The premise and foundation of Natural Language Processing is an algorithm based on the number of terms of the hit number of the search engine to calculate the similarity of terms. The number of return hits by the search engine for the term retrieval can be used for the terminology. The similarity is quantified. However, based on a large general search engine, the number of terminology is limited to a specific domain, which often affects the similarity calculation of terms. This paper aims to improve the effect of terminology retrieval by establishing a vertical search engine system for traffic topics to improve the terms similarity. The purpose of precision is to be calculated.
The thesis first studies and realizes the construction of vertical search engine based on traffic theme. Its main work is to grab web pages containing traffic terms in the field of traffic. The paper develops the web crawling program of traffic subject under the framework of Heritrix project of open source crawler program. Take.
Secondly, the web page information was formatted and the redundant information was filtered out, and the index library of the retrieval system was constructed. The index library established in this paper is to write the index program under the condition of open source Lucene, to establish an orderly search for the parsed traffic topic web page, and to realize the full text of the traffic terms in the index library. Retrieves and retrieves the specific hit number of the term in the index base.
Finally, we use the Web-PMI algorithm to carry out the experiment of similarity calculation of traffic standard terms. In the algorithm, the retrieval formula based on traffic terms is re constructed, and the retrieval operator is added to reduce the occurrence of ambiguity in the retrieval results, improve the domain correlation of the retrieval results and improve the effect of the algorithm. The experimental results are analyzed and the improved retrieval formula is proposed. The retrieval number of terminology is increased, and the effect of term coincidence on the computation of terminology similarity is eliminated.
The method proposed in this paper is applied to the "traffic information consistency detection research" project. The application results show that the search engine system based on the traffic vertical theme based on this paper can play a very good effect on the similarity calculation of the unsocial terminology in the traffic field, compared with the calculation accuracy of the commercial search engine Alta Vista. The method proposed in this paper is also applicable to the calculation of terminology similarity in other specialized fields, and it can also effectively support the work of terminology standardization, identification of synonyms and synonyms, semantic retrieval, and Terminology Standard analogical detection.
【学位授予单位】:长安大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1;U11-61
【参考文献】
相关期刊论文 前10条
1 吴伟;陈建峡;;基于Heritrix的web信息抽取优化与实现[J];湖北工业大学学报;2012年02期
2 付年钧;彭昌水;王慰;;中文分词技术及其实现[J];软件导刊;2011年01期
3 刘淑梅;夏亮;许南山;;主题搜索引擎网络爬虫搜索策略的研究与实现[J];计算机系统应用;2010年03期
4 孟祥成;;基于Lucene和Heritrix技术搜索引擎的设计与实现[J];中国现代教育装备;2010年03期
5 陈兰;金远平;;基于本体的垂直搜索引擎研究[J];计算机应用与软件;2009年11期
6 周薇;;常用中文搜索引擎的应用、分析和比较[J];图书情报工作;2009年S1期
7 邹永斌;陈兴蜀;王文贤;;基于贝叶斯分类器的主题爬虫研究[J];计算机应用研究;2009年09期
8 马费成;望俊成;吴克文;邱璇;;国外搜索引擎检索效能研究述评[J];中国图书馆学报;2009年04期
9 周程远;朱敏;杨云;;基于词典的中文分词算法研究[J];计算机与数字工程;2009年03期
10 张贤;周娅;;基于Lucene网页排序算法的改进[J];计算机系统应用;2009年02期
相关硕士学位论文 前6条
1 李新友;信息检索中的查询扩展技术研究[D];广西师范大学;2010年
2 谢冬松;基于Web的主题搜索应用技术研究[D];黑龙江大学;2007年
3 王晓伟;垂直搜索引擎若干关键技术的研究[D];浙江大学;2007年
4 许顺;中文分词规范可计算化的研究与实现[D];苏州大学;2006年
5 寿周翔;专业搜索引擎的研究与设计[D];浙江大学;2005年
6 王亮;搜索引擎及其相关性排序研究[D];武汉大学;2004年
本文编号:2122141
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2122141.html