当前位置:主页 > 科技论文 > 计算机论文 >

基于MongoDB云存储平台的论坛信息抽取与存储研究

发布时间:2018-04-26 01:25

  本文选题:云计算 + 非关系数据库 ; 参考:《上海交通大学》2012年硕士论文


【摘要】:互联网技术的迅猛发展,以及手机、平板、智能电视等各种输入终端的普及,让互联网数据呈现出爆炸性的增长。面对海量的数据,如何能以更加稳定、快速的方式存储海量数据,以及从中挖掘出有价值的信息,成为很多企业面临的新课堂。云存储的出现为数据挖掘快速的发展带来了新的机遇。亚马逊、微软、谷歌、IBM等等巨头纷纷推出了自己的云存储平台,国内百度,华为、腾讯、360等等公司也加紧了在云存储领域的布局。论文以海量的论坛数据做存储样本,搭建了一个支持水平扩展的实验系统。设计并实现了多种论坛数据抽取的方法。最后验证了云存储带来的性能优势。本文主要开展了以下几方面的工作: 1)本文详细介绍了因云存储发展而带动起来的NOSQL,,阐述了各类NOSQL的特点,根据论坛数据的特征,最终筛选了MongoDB来存储数据,并把它与流行的传统关系库MYSQL做了比较,总结了MongoDB的部分优势。随后介绍了MongoDB的使用方式和存储论坛数据的方法。 2)简述了各类论坛信息抽取的方法,随后分析国内论坛的特点和论坛本身的结构特征,把论坛分成两类:通用论坛和专用论坛。对于通用论坛,用正则表达式进行精确的信息获取;对于专用论坛,提出并设计了一套启发式的抽取方法。应用不同的抽取方法抽取各类论坛数据,提高了抽取准确率。 3)为验证新设计的存储方式,以及各类论坛信息抽取算法的可行性。本文结合多种论坛数据挖掘方法,设计了一个基于MongoDB分布式存储的论坛抽取实验系统,使系统能支持水平扩展和稳定的存储海量论坛数据,并且准确的挖掘出论坛中各类有用的数据。待存储的数据量达到一定规模后,测试了论坛大数据的存储能力,比较了多种查询下的存储性能。得出了分布式环境下的云存储,在处理大数据上,与单服务架构的MongoDB相比,具有压倒性的优势。 4)最后对论文工作进行了总结,并讨论了存在的问题和对进一步工作的展望。
[Abstract]:With the rapid development of Internet technology and the popularity of mobile phone, flat panel, smart TV and other input terminals, Internet data has shown explosive growth. In the face of the massive data, how to store the massive data in a more stable and fast way, and how to mine valuable information from it has become a new classroom for many enterprises. The emergence of cloud storage brings new opportunities for the rapid development of data mining. Amazon, Microsoft, Google, IBM and other giants have launched their own cloud storage platform, and domestic companies such as Baidu, Huawei, Tencent, and so on have stepped up their layout in the cloud storage field. In this paper, a large amount of forum data is used to store samples, and an experimental system supporting horizontal expansion is built. Design and implementation of a variety of forum data extraction methods. Finally, the performance advantage of cloud storage is verified. The main work of this paper is as follows: 1) this paper introduces NOSQLs driven by the development of cloud storage in detail, expounds the characteristics of various kinds of NOSQL, according to the characteristics of forum data, finally selects MongoDB to store data, and compares it with the popular traditional relational library MYSQL. Some advantages of MongoDB are summarized. Then it introduces the usage of MongoDB and the method of storing forum data. 2) this paper briefly introduces the methods of extracting information from various forums, then analyzes the characteristics of the domestic forums and the structural characteristics of the forums themselves, and classifies the forums into two categories: the general forum and the special forum. For general forums, regular expressions are used to obtain accurate information, and for special forums, a heuristic extraction method is proposed and designed. Different extraction methods are used to extract all kinds of forum data, which improves the accuracy of extraction. 3) to verify the feasibility of the new storage method and the algorithms for extracting information from various forums. In this paper, we design a forum extraction experiment system based on MongoDB distributed storage, which can support horizontal expansion and stable storage of massive forum data. And accurately excavate all kinds of useful data in the forum. After the amount of data to be stored reaches a certain scale, the storage capacity of big data is tested, and the storage performance of various queries is compared. It is concluded that cloud storage in distributed environment has an overwhelming advantage over MongoDB in single service architecture in dealing with big data. Finally, the paper summarizes the work, discusses the existing problems and prospects for further work.
【学位授予单位】:上海交通大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP333;TP311.13

【参考文献】

相关期刊论文 前4条

1 张国印,陈先,皮鹏;基于词频统计的个性化信息过滤技术[J];哈尔滨工程大学学报;2003年01期

2 潘凡;;从MySQL到MongoDB——视觉中国的NoSQL之路[J];程序员;2010年06期

3 李向阳,苗壮;自由文本信息抽取技术[J];情报科学;2004年07期

4 张启宇;朱玲;张雅萍;;中文分词算法研究综述[J];情报探索;2008年11期



本文编号:1803870

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1803870.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c292f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com