基于MapReduce的Web文本挖掘系统的研究与实现

发布时间：2018-02-28 19:14

本文关键词： Web挖掘 MapReduce MongoDB 社会网络分析命名实体　出处：《北京邮电大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着互联网媒体时代的成熟和完善,越来越多的媒体信息开始在通过这种快捷、廉价的方式进行发布传输,网络上的信息数量异常庞大,并且伴随着对互联网应用的深入,正在以惊人的速度增长。搜索引擎可以帮助我们从互联网上获取较为准确的相关信息的网页,但是获取的信息比较初级、宽泛,无法确认这些信息的内在关联和实体模型,仍然需要进行进一步的分析加工。这时候一个可选的方法就是借鉴通用的网络分析的方法,对实体化后的异构web信息进行关系挖掘以及模型分析,以发掘出其潜在的、有价值的知识。本文主要研究MongoDB分布式数据库和Hadoop分布式计算框架,并基于MongoDB的数据建模和Hadoop的MapReduce计算设计高效的Web新闻实体分析方案,具体的研究工作以及内容包括： 1、采取基于XML分析的方法,对搜狗实验室的Web新闻数据进行半结构化分析,提取相应的信息,并在MapReduce框架下对文本内容进行分词处理,并利用TF-IDF算法计算关键词权重,提取文本特征表达式。 2、基于MongoDB的数据模型以及并行处理,结合关系网络分析算法,使用点度中心性算法分析单个实体节点在实体关系网络中的中心势,以实现对新闻主题实现核心挖掘；结合凝聚子群分析,挖掘出相互之间联系比较紧密的小团体,构建实体间的块模型。 3、应用基于文档的非关系型数据库MongoDB,利用其强大的建模能力,设计能够描述文本特征的数据模型,并结合Hadoop的MapReduce并行计算框架,在J2EE的架构下,完成对Web新闻的分布式存储和计算平台的设计和搭建,并对所获取的分析结果利用JUNG技术进行展示。
[Abstract]:With the maturity and perfection of the Internet media era, more and more media information begin to be released and transmitted through this kind of quick and cheap way. The amount of information on the network is extremely large, and with the deepening of the Internet application, It's growing at an alarming rate. Search engines can help us get more accurate pages of relevant information from the Internet, but the information we get is rudimentary, broad and unable to confirm the intrinsic relevance and physical model of that information. At this time, an alternative method is to use the general network analysis method to mine the heterogeneous web information and analyze the model, so as to find out its potential. Valuable knowledge. This paper mainly studies MongoDB distributed database and Hadoop distributed computing framework, and designs an efficient Web news entity analysis scheme based on MongoDB data modeling and Hadoop MapReduce computing. The specific research work and content include:. 1. Based on the method of XML analysis, semi-structured analysis of Web news data in Sogou laboratory is carried out, and the corresponding information is extracted, and the word segmentation of text content is processed under the framework of MapReduce, and the keyword weight is calculated by TF-IDF algorithm. Extract the text feature expression. 2. Based on the data model of MongoDB and parallel processing, combining with the analysis algorithm of relational network, the point centrality algorithm is used to analyze the central potential of a single entity node in the entity relational network, in order to realize the core mining of news topic; Based on the condensed subgroup analysis, small groups with close relationship are mined, and the block model between entities is constructed. 3. Using MongoDB, a non-relational database based on documents, and using its powerful modeling ability, we design a data model that can describe the text features, and combine with the MapReduce parallel computing framework of Hadoop, under the framework of J2EE. The distributed storage and computing platform of Web news is designed and built, and the analysis results obtained are displayed by JUNG technology.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】