主题自适应学术会议搜索系统
发布时间:2018-06-22 02:56
本文选题:学术会议搜索 + 支持向量机 ; 参考:《华中科技大学》2013年硕士论文
【摘要】:据不完全统计,每年在世界各地举办的国际学术会议的数量达到了1万多次,参会人次也有百万之巨,学术交流活动日益频繁。而且,学术会议的种类繁多,特点复杂,有的是一次性的会议,有的则是系列性的会议。面对数量庞大的研究者关于学术会议信息检索的急切需求,主要关注于文献检索的现有学术搜索引擎与数字图书馆已显得力不从心,难以满足用户的检索要求。 Acrost是一个面向CFP(Call For Papers)的主题自适应学术会议搜索系统,它具有基于主题检索方式的特点,除了提供学术会议检索服务之外,它还具有投稿推荐特色服务。为了获取充足的数据源,系统使用了两种方式:(1)基于通用搜索引擎的方法,节省了大量的资源开销,采用支持向量机分类器过滤噪声信息;(2)基于向量空间模型的主题爬虫,定向地爬取学术会议网页。在获取了原始的学术会议网页之后,利用正则表达式与条件随机场分别对半结构化和非结构化网页进行信息抽取和实体识别,从而采集学术会议元数据。然后,利用Lucene对元数据建立倒排索引;同时,提出了一种基于增量层次聚类算法的主题发现方法,对用户上传的PDF文档进行解析并自动发现其所属主题领域。另外,系统中建立了一套基于学术影响因子的学术会议评价模型,其考虑的指标包括篇均被引用计数、论文录用率等。 实验结果表明,Acrost系统的学术会议检索服务的召回率、准确率及F度量分别是84.8%、90.5%、87.6%;投稿推荐服务的召回率、准确率及F度量分别是60.8%、68.7%、64.5%;同时,Acrost系统能够快速地响应用户的服务请求。这表明,Acrost系统在相关性判定与运行速度方面都具备了较好的性能。
[Abstract]:According to incomplete statistics, the number of international academic conferences held in various parts of the world has reached more than 10,000 every year, and the number of participants is over one million, and academic exchange activities are becoming more and more frequent. Moreover, academic conferences are of many kinds and complex characteristics, some are one-off meetings and some are series meetings. Facing the urgent demand of a large number of researchers on the information retrieval of academic conferences, the existing academic search engines and digital libraries, which mainly focus on literature retrieval, have been unable to do so. Acrost is a topic adaptive academic conference search system for CFP (call for papers). It also has the contribution recommendation characteristic service. In order to obtain sufficient data sources, the system uses two ways: (1) the method based on general search engine saves a lot of resource overhead, and adopts support vector machine classifier to filter noise information; (2) the topic crawler based on vector space model. Crawl the academic conference web page in a directed way. After obtaining the original academic conference pages, the regular expression and conditional random field are used to extract information and identify entities from semi-structured and unstructured web pages, respectively, so as to collect the metadata of academic meetings. At the same time, a topic discovery method based on incremental hierarchical clustering algorithm is proposed, which parses the PDF documents uploaded by users and automatically finds the subject areas to which they belong. In addition, a set of academic conference evaluation model based on academic influence factor is established in the system. The indexes considered in the model include the number of references and the employment rate of papers. The experimental results show that the recall rate, accuracy rate and F metric of the academic conference retrieval service in Acrost system are 84.8and 90.5and 87.6respectively, the recall rate, accuracy and F measurement of the contribution recommendation service are 60.888.7and 64.5respectively. At the same time, Acrost system can quickly respond to user's service request. This shows that Acrost system has better performance in relation determination and running speed.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前3条
1 周立柱,林玲;聚焦爬虫技术研究综述[J];计算机应用;2005年09期
2 刘金红;陆余良;;主题网络爬虫研究综述[J];计算机应用研究;2007年10期
3 谌志群;张国煊;;文本挖掘研究进展[J];模式识别与人工智能;2005年01期
,本文编号:2051253
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2051253.html