中文专利侵权检索模型研究

发布时间：2018-05-07 11:47

本文选题：中文专利权利要求书 + 分词　；参考：《北京工业大学》2012年硕士论文

【摘要】：随着社会的发展进步，人们对知识产权的重视程度大幅度提高，随之出现的是专利申请数量剧增，与之相伴随的还有专利侵权案件及专利无效宣判案件的扩增。这些问题出现的主要原因是目前的信息检索水平有待提高：信息查全率、查准率较低，还不能在海量的专利等相关文献中将所有与主题相关的信息全部呈现出来，检索结果存在大量无关信息，这些给用户造成巨大的干扰。本文在研究信息检索及专利侵权研究现状的基础上，，利用文本挖掘的思想，系统的对基于中文的专利侵权检索模型进行构建。专利侵权检索主要分为两种类型：规避侵权检索及主动侵权检索。规避侵权检索旨在根据用户自己的专利（已经申请或者未申请）、产品必要技术特征、研发方向的技术特征内容，将可能会侵犯的已审批专利检索出来。主动侵权检索旨在根据用户自己的专利（已经授权）检索是否有相同的专利被重复授权。本文主要内容包括：数据获取及文本预处理、专利侵权检索模型构建、系统实现、实验效果评估及对研究的总结展望。本研究的专利实验数据由中国国家知识产权局公布的发明、实用新型专利组成，通过对专利独立权利要求书进行一系列的处理操作，从而将疑似侵权专利呈现出来。在数据获取及文本预处理部分首先将图片格式的专利权利要求书通过OCR工具转换为纯文本。其次，总结归纳转换过程中的字符识别错误及格式错误，对这些错误进行纠正。再次，在中科院ICTCLAS分词系统的基础上，提出一种适合中文专利权利要求书的分词算法，对实验数据进行分词处理。最后根据需要对可能用到的著录项、专利文本、分词结果等提取出来，保存成XML文本，形成XML数据库。在专利侵权检索模型构建部分通过对专利侵权判定原则及专利权利要求书的特征进行分析，提出利用专利必要技术特征集合覆盖度计算来代替传统的文本向量夹角余弦相似度计算方式，实验证明该方法具有可行性。除此之外本文还对本体的构建、倒排索引的构建等进行阐述说明。在系统实现及实验效果评估部分，陈述了系统的实现环境、主要使用技术、部分核心代码及算法的实验效果。本文的创新点在于：第一，利用OCR将PDF文件转换为文本文件，并进行容错处理。第二，根据中文专利权利要求书特点，进行分词处理，并利用特征词进行特征提取。第三，提出根据专利必要技术特征覆盖度算法进行专利侵权判定的方法。
[Abstract]:With the development and progress of the society, people pay more attention to the intellectual property rights, and the number of patent applications increases dramatically. There are also patent infringement cases and patent invalidation cases. The main reason for these problems is that the current information retrieval level needs to be improved: the information recall rate, and the investigation of the information retrieval rate. The quasi rate is low, and all the information related to the subject can not be presented in a large number of patents and other related documents. There are a lot of unrelated information in the retrieval results, which cause huge interference to the users. On the basis of the research on information retrieval and patent infringement research, this paper uses the idea of text mining and is based on the system. The patent infringement retrieval model is constructed in Chinese. The patent infringement retrieval is divided into two types: the avoidance of tort retrieval and the active tort retrieval. The avoidance of tort retrieval aims at the necessary technical features of the product, the technical features of the R & D, and the possible infringement on the user's own patent (which has been applied or not applied). Active infringement search is aimed at retrieving whether the same patent is duplicated according to the user's patent (authorized).
The main contents of this paper include: data acquisition and text preprocessing, construction of patent infringement retrieval model, system implementation, evaluation of experimental results and summary of research. The patent experiment data of this study are published by the China National Intellectual Property Office, utility model patent group, through a series of patent claims. In the data acquisition and text preprocessing section, the patent claim of picture format is first converted to pure text by OCR tool. Secondly, the character recognition error and format error in the conversion process are summarized and corrected. Again, in the Chinese Academy of Sciences ICTC On the basis of the LAS participle system, a participle algorithm suitable for Chinese patent claims is proposed, which is used to deal with the experimental data. Finally, according to the requirements, the possible cataloguing items, the patent text, the result of the participle are extracted, and the XML text is preserved and the XML data base is formed. The characteristics of the principle of decision and the patent claim are analyzed. It is proposed to use the cover degree calculation of the necessary technical features of the patent to replace the traditional text vector angle cosine similarity calculation method. The experiment proves that the method is feasible. Besides, this paper also expounds the construction of the ontology and the construction of the inverted index. The implementation of the system and the evaluation of the experimental results show the implementation environment of the system, mainly using the technology, some core codes and the experimental results of the algorithm.
The innovation points of this paper are: first, using OCR to convert PDF files into text files and carry out fault-tolerant processing. Second, according to the characteristics of Chinese patent claims, we carry out participle processing and use characteristic words for feature extraction. Third, the method of patent infringement judgment based on patent necessary technical characteristic overlay algorithm is put forward.

【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3;G306

【参考文献】