面向科技项目的相似度计算和聚类算法研究

发布时间：2018-05-29 02:49

本文选题：VSM + 语义理解　；参考：《杭州电子科技大学》2015年硕士论文

【摘要】：随着我国对科技经费投入的逐渐增多,科研单位科技项目的申请也变得越来越多,怎么样有效的解决项目重复立项问题是现代科技项目管理中非常重要的一部分。传统的人工查重显然是不行的,而已有的一些查重系统在精度和速度上都不能满足要求,因此对项目查重系统关键技术的研究就变得非常有必要。本文重点对科技项目的表示模型、相似度计算和聚类等技术进行研究。主要工作包括以下几个方面：1.根据科技项目内容复杂、信息大的特点,提出一种结合物元知识表示模型和向量空间模型的科技项目知识表示模型和科技项目关系模型,方便后续对科技项目的表示和处理。2.针对科技项目的查重需求,分析总结了基于向量空间模型的相似度计算方法和基于语义理解的相似度计算方法,在此基础上提出了一种基于语义理解的VSM相似度计算方法。针对科技项目名称中含有大量有用信息,字数较少且含有较多专业名词的特点,提出了一种改进的基于编辑距离的句子相似度计算方法。最后把以上两种方法分别应用于科技项目的主要内容和项目名称的相似度计算中,并进行权重调整,综合计算整个科技项目的相似度。3.针对科技项目查重时需把待查项目和已有所有项目进行比对,效率较低的问题,本文先进行项目聚类然后再进行查重。而已有的聚类算法有需要预先输入参数和算法时间复杂度较高无法应用于大型项目库等问题,本文提出一种基于双阈值的最近邻项目聚类算法并应用于项目查重系统,在不影响查重精度的情况下,提高了查重速度。在以上相似度计算方法和聚类算法研究成果的基础上,实际应用于浙江省科技项目相似度检测系统中,有效地实现了项目查重功能,并且有良好查重准确度和运行速度,成功验证了本论文研究成果的可行性。
[Abstract]:With the increasing investment of science and technology funds in our country, the application of scientific and technological projects in scientific research units has become more and more. How to effectively solve the problem of project duplicate establishment is a very important part of modern science and technology project management. It is obvious that the traditional manual checking is not feasible, and some of the existing checking systems can not meet the requirements in accuracy and speed. Therefore, it is necessary to study the key technologies of the item checking and rechecking system. This paper focuses on the representation model of scientific and technological projects, similarity calculation and clustering techniques. The main work includes the following aspects: 1. According to the characteristics of complex contents and large information of scientific and technological projects, a model of knowledge representation of scientific and technological projects and a relational model of scientific and technological projects are proposed in combination with matter-element knowledge representation model and vector space model, which can facilitate the subsequent representation and processing of scientific and technological projects. According to the need of scientific and technological projects, this paper analyzes and summarizes the similarity calculation methods based on vector space model and semantic understanding. Based on this, a VSM similarity calculation method based on semantic understanding is proposed. In view of the fact that the names of scientific and technological projects contain a lot of useful information, fewer words and more professional nouns, an improved sentence similarity calculation method based on editing distance is proposed. Finally, the above two methods are applied to the similarity calculation of the main contents of the science and technology project and the name of the project, and the weight is adjusted to calculate the similarity of the whole science and technology project. 3. In order to solve the problem that it is necessary to compare the items to be checked with all the existing items and the efficiency is low, this paper first clusters the items and then checks them again. However, the existing clustering algorithms need to input parameters in advance and the time complexity of the algorithms can not be applied to large project library. In this paper, a clustering algorithm for nearest neighbor items based on double thresholds is proposed and applied to the item checking system. Under the condition of not affecting the checking accuracy, the checking speed is improved. On the basis of the above research results of similarity calculation method and clustering algorithm, it has been applied to the similarity detection system of science and technology projects in Zhejiang Province. It has effectively realized the function of checking duplicate of items, and has good accuracy and running speed. The feasibility of the research results is verified successfully.
【学位授予单位】：杭州电子科技大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.1

【参考文献】