基于中文科技文献关键词的聚类系统的设计与实现

发布时间：2018-04-22 14:46

本文选题：网络爬虫 + 原子词　；参考：《北京邮电大学》2012年硕士论文

【摘要】：从古至今,信息对于人类的生产生活等方方面面都是非常重要的,特别是对于处于信息时代的我们更是如此。随着互联网技术的飞速发展,互联网己成为我们获取各种信息的主要渠道,但是充斥于互联网中的信息每天都在以指数规模增长且各种信息交织在一起,在这种情况下,如何准确获取有效信息自然成为了研究的重点。聚类分析是自然语言处理技术中的一项重要技术,是挖掘隐藏在海量信息背后的有效信息的重要方法。对于科学研究来说,一方面,各类论文、期刊、文献数量庞大难以全部阅读；另一方面,搜索引擎技术的大量运用也提供了数量巨大的词汇来帮助我们发现各种信息,因此如何通过对各种已有词汇进行聚类来发现有用信息成为了一个有实际意义的课题。本文首先分析了在当今信息爆炸的背景下,科学研究工作所面临的困境——文献检索技术无法满足实际应用需求,接着对文献检索技术进行了介绍,特别是对于其核心模块之一的网络爬虫进行了深入的研究,并对时下网络爬虫研究的重点方向——聚焦网络爬虫进行了全面和细致的学习,包括其体系结构、关键技术等。其次,本文简要介绍了自然语言处理中的聚类技术,并在此基础上分别介绍了词聚类技术、概念聚类技术。通过对目前流行的词聚类技术进行仔细分析,本文针对其聚类空间维度通常过高导致聚类复杂度过高的问题采用了基于原子概念的词聚类以期降低聚类复杂度,最终目的是结合网络爬虫技术与词聚类技术,通过在海量信息背景下基于原子概念的词聚类来解决当前由于信息爆炸导致的科学研究热点发现困难的问题。最后,在深入研究了上述理论知识的基础上,本文一方面设计并实现了用于从指定网站抓取指定数据的网络爬虫程序,另一方面,本文充分利用自然语言处理技术中的中文词聚类技术借助MATLAB中的FCM算法实现了基于原子概念的中文词聚类系统,并对实验结果进行了分析,基本取得了预期的效果。
[Abstract]:From ancient to present, information is very important for all aspects of human production and life, especially for us in the information age. With the rapid development of Internet technology, the Internet has become the main channel for us to obtain all kinds of information, but the information in the Internet is growing exponentially every day and all kinds of information are intertwined together, in this case, How to accurately obtain effective information has naturally become the focus of research. Clustering analysis is an important technology in natural language processing, and it is an important method to mine the effective information hidden behind massive information. For scientific research, on the one hand, the large number of papers, periodicals and documents is difficult to read; on the other hand, the extensive use of search engine technology also provides a large number of words to help us find all kinds of information. Therefore, how to find useful information by clustering all kinds of existing words has become a meaningful topic. This paper first analyzes the predicament of scientific research work under the background of information explosion, that is, the literature retrieval technology can not meet the practical application needs, and then introduces the literature retrieval technology. In particular, the network crawler, one of its core modules, has been deeply studied, and the focus of the current research on web crawler-focused web crawler, including its architecture, key technology and so on, has been studied comprehensively and meticulously. Secondly, this paper briefly introduces the clustering technology in natural language processing, and then introduces the word clustering technology and the concept clustering technology respectively. Through the careful analysis of the current popular word clustering techniques, this paper uses the word clustering based on the atomic concept to reduce the clustering complexity in order to reduce the clustering complexity, which is usually caused by the high spatial dimension of the clustering. The ultimate goal is to solve the problem of difficult scientific research hot spot discovery caused by information explosion through word clustering based on atomic concept in the context of massive information. The purpose of this paper is to combine the technology of web crawler and word clustering. Finally, based on the above theoretical knowledge, on the one hand, this paper designs and implements a web crawler program to capture the specified data from a designated website, on the other hand, In this paper, we make full use of the Chinese word clustering technology in natural language processing technology to realize the Chinese word clustering system based on atomic concept with the help of FCM algorithm in MATLAB. The experimental results are analyzed and the expected results are obtained.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1;TP311.13

【参考文献】