当前位置:主页 > 管理论文 > 项目管理论文 >

基于Hadoop的科技项目相似度计算研究

发布时间:2018-02-28 18:31

  本文关键词: 科技项目 相似度计算 图模型 最大团 Hadoop 出处:《河北工业大学》2015年硕士论文 论文类型:学位论文


【摘要】:《国家中长期科学和技术发展规划纲要(2006-2020年)》实施以来,我国财政科技投入快速增长,科技项目和资金管理不断改进,为科技事业发展提供了有力支撑。同时也给科技项目管理工作带来了新的挑战:第一,随着科技项目申报数量的增加存在项目重复申报、重复立项等突出问题。第二,随着各学科不断细化以及学科交叉、融合日益加剧,科技项目研究的广泛交流与合作是科技发展的重要推动力,根据项目的相似度进行合理的整合是未来发展的趋势。加强项目相似度分析是解决这些问题的关键,项目的相似度分析一般是通过申请书的相似度计算找到相似项目,从而为项目立项提供一定依据,论文主要研究内容包括以下几个方面。首先,分析科技项目相似度计算的关键技术,针对科技项目申请书中存在的大量专业术语,提出一种改进的基于词序列频率有向网的未登录词识别方法。该方法依据词性对项目申请书的分词进行过滤,并结合停用词表对提取出的未登录词进行过滤。将提取出的未登录词作为特征词的一部分,结合剩余特征词构建基于向量空间和图模型的申请书表示模型,然后基于该模型计算申请书的相似度。其次,提出最大团方法求解图模型的相似度。图模型的相似度可以通过最大公共子图求解,同时图的最大公共子图问题又可以转化成求解最大团问题。最后,随着科技项目数量的增加,科技项目相似度计算涉及到的申请书预处理、特征词提取以及相似度计算等技术计算量大、计算时间长,为解决这一问题本文结合Hadoop分布式计算平台,利用MapReduce并行计算框架将申请书相似度计算每一个过程分解为Map和Reduce任务。
[Abstract]:Since its implementation, China's financial investment in science and technology has increased rapidly, and the management of scientific and technological projects and funds has been continuously improved. It has provided strong support for the development of scientific and technological undertakings. At the same time, it has also brought new challenges to the management of scientific and technological projects. First, with the increase in the number of scientific and technological projects declared, there are outstanding problems such as repeated reporting and duplicate projects. Second, With the continuous refinement and intersection of various disciplines and the increasing integration, the extensive exchange and cooperation of scientific and technological research is an important driving force for the development of science and technology. It is the trend of the future development to integrate the items according to the similarity degree of the project, the key to solve these problems is to strengthen the similarity analysis of the project, and the similarity analysis of the project is usually to find the similar items through the similarity calculation of the application form. In order to provide a certain basis for the project establishment, the main research content includes the following aspects. Firstly, the key technology of the similarity calculation of scientific and technological projects is analyzed, and a large number of technical terms in the application form of scientific and technological projects are analyzed. An improved unrecorded word recognition method based on word sequence frequency directed net is proposed. Combined with the stop word table, the extracted unrecorded words are filtered. The extracted unrecorded words are taken as a part of the feature words, and the application representation model based on vector space and graph model is constructed by combining the remaining feature words. Then the similarity of the application form is calculated based on the model. Secondly, the maximum cluster method is proposed to solve the similarity of the graph model. The similarity of the graph model can be solved by the maximum common subgraph. At the same time, the maximum common subgraph problem of graph can be transformed into solving the maximum cluster problem. Finally, with the increase of the number of scientific and technological projects, the application preprocessing involved in the similarity calculation of scientific and technological projects is obtained. In order to solve this problem, this paper combines the Hadoop distributed computing platform with the large amount of computation and the long computing time of feature word extraction and similarity calculation. Each process of application similarity calculation is decomposed into Map and Reduce tasks by using MapReduce parallel computing framework.
【学位授予单位】:河北工业大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.1

【参考文献】

相关期刊论文 前1条

1 翟荔婷;;浅谈中文文本分词方法[J];经营管理者;2012年18期



本文编号:1548449

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/xiangmuguanli/1548449.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户7e1ea***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com