当前位置:主页 > 管理论文 > 移动网络论文 >

社交媒体数据上的时态关键词查询

发布时间:2018-08-05 19:33
【摘要】:社交媒体服务已经成为人们日常使用最频繁的互联网服务之一,它记录了用户发布的原创内容、转发与评论。随着数据的不断累积,这些长时间跨度的数据对于研究用户的集群行为、全面理解人或者事件都具有重要意义。关键词查询由于简便易用也被用于从海量的社交媒体数据中查询相关信息。用户为了追踪事件的发展,会频繁提交同一查询以获得事件最新消息;分析人员为了彻底地了解分析对象,需要收集不同时期的数据。然而现有的社交媒体搜索服务和研究工作主要关注实时搜索,信息中记录的发布时间也仅被用于衡量数据的时效性。本文使用社交媒体数据流模型对原创内容及其转发和评论进行建模,为每个社交对象定义其引用时间序列。基于该模型,时态关键词查询使用关键词作为查询的内容约束,以时间序列数据在查询时间范围内的和作为相应打分函数的输入,选出分值最大的k条社交对象。本文将时间提升为查询的一个约束条件,以此用一种查询同时满足实时追踪与分析探索两类应用场景。随后从离线索引下可利用的社交媒体数据特点以及在线索引时需要面临的索引更新效率两个角度出发,分别提出了针对这一查询的索引技术与查询算法。最后,本文基于时间序列数据分析了新浪微博兴衰背后信息传播的变化,也基于实时的社交媒体数据流构建了一个在线的微博分析平台,它们构成了时态关键词查询的应用示例。全文围绕着时态关键词查询这一问题展开,主要贡献包括以下三方面:·设计了基于社交媒体数据特点的双层索引结构以及分段最大近似摘要。一方面,社交对象的引用树在规模与生命周期长度的分布上都服从长尾分布。另一方面,社交对象也往往只在某些时间段内保持热门,在其余很长的时间内都极少被关注。本文基于社交媒体数据上的以上两个特点,分别设计了双层倒排列表结构以及分段最大近似摘要。其中,双层倒排列表结构使用不同的索引结构分别管理热门对象和普通对象,两种结构都支持从时间维度过滤数据,并按照社交对象最终引用树大小的逆序返回数据。通过基于引用树大小长尾分布的理论分析,本文揭示了使用该索引的查询算法需要访问数据量的上界。真实数据集上的统计分析结果表明,大部分情况下算法访问数据量的上界随k值成亚线性的关系。本文进一步提出了分段最大近似摘要,它能够更加准确地预估每个对象在查询窗口内引用树大小的上界,从而避免计算查询窗口内处于非热门状态的热门对象的实际分值所产生的磁盘访问。·提出了解决实时时态关键词查询的日志结构八叉树索引。社交媒体数据的另一个特征是用户数据的高速生成,这一现象在热点事件期间显得尤为突出。因此面对在线索引场景时,快速索引这些数据并及时将其反映到查询结果中,无论对提升普通用户的用户体验,还是为快速决策提供及时的数据支持,都具有重要意义。本文将每个社交对象的引用时间序列近似得到的近似段数据映射至三维空间中的点,并利用八叉树同时保持了索引中社交对象在重要性与时间维度上的局部性。八叉树节点对应的编码方法使得索引既支持了时间维度的数据过滤,也保证了时态阈值算法所需要的数据返回顺序。而与日志结构合并树的结合,充分利用了内存访问的快速与磁盘顺序读写的高效,实现了社交媒体数据的快速索引。·利用时态关键词查询实现了基于海量与实时的社交媒体数据上的分析应用。本文基于170万用户群体在大约5年内的全量微博行为数据,分析了新浪微博兴衰背后信息传播的变化。时态关键词查询在这一分析过程中被用于提升数据抽取规则的准确性,有助于覆盖更加全面的数据。通过从单条微博转发时间序列的建模出发,提出了使用对数高斯模型对一组微博的转发模型参数进行拟合的方法,并指出了与信息传播速度相关的一个统计量。本文进一步定义了用户在新浪微博平台上的各种行为特征,以及反映整个网络用户对各社交平台态度的外部特征,分析了它们的变化趋势并且探索它们与反映信息传播的统计量之间的关系。本文最后将全文相关的技术系统化,构造了一个基于新浪微博的实时微博数据流的在线分析平台。它能够将时态关键词查询检索的结果聚类成话题,并从多个维度展示话题的初步统计分析结果。综上所述,本文扩展了社交媒体数据上已有的关键词查询功能,提出了时态关键词查询,并从社交媒体数据的数据特点以及索引的更新效率两个方面探索了索引的组织结构以及查询算法。以该查询为基础的两个分析应用表明,它能够更加灵活地适应各类应用场景,有助于用户从社交媒体数据中发掘重要信息,为后续展开更加复杂的分析任务提供了数据基础。本文最后构建的公开可访问的系统实现了文中的索引与分析技术,使各领域的研究人员以及分析人员能受益于海量实时的社交媒体数据。
[Abstract]:Social media services have become one of the most frequent Internet services used in people's daily use. It records the original content, forwarded and commented by users. With the continuous accumulation of data, these long - span data are of great significance to the study of the user's cluster behavior and the overall understanding of people or events. In order to track events, users will frequently submit the same query in order to get the latest news of the event. In order to understand the object thoroughly, the analyst needs to collect data at different times. However, the existing social media search service and research Work is mainly focused on real-time search, and the release time recorded in information is also used to measure the timeliness of data. This paper uses social media data flow model to model original content, forward and comment, and defines its reference time series for each social object. Based on this model, keyword query uses keywords as a check. In this paper, the time series data in the query time range and the input of the corresponding scoring function are selected to select the largest K social object with the maximum value. In this paper, the time is promoted to a constraint condition of the query. In this paper, two kinds of application scenarios are explored with a query and real-time tracking and analysis. Then the offline index is followed by an offline index. The characteristics of the available social media data and the index update efficiency of the online index are two points of view. The index technology and query algorithm for this query are proposed. Finally, based on the time series data, this paper analyses the change of information propagation behind the rise and fall of sina micro-blog, and also based on the real time social media number. According to the stream, an online micro-blog analysis platform is built, which constitute an example of the application of temporal keyword query. The full text is carried out around the question of temporal keyword query. The main contributions include the following three aspects:. The design of a double index structure based on the characteristics of social media data and the maximum approximate summary. The reference tree of the intersection obeys the long tail distribution in the size and life cycle length. On the other hand, the social objects are often kept hot in some time periods, and are rarely concerned for the rest of the long time. This paper designs a double inverted list structure based on the above two characteristics of social media data. The double inverted list structure uses different index structures to manage the hot objects and ordinary objects respectively. The two structures all support the filtering of data from the time dimension and return the data according to the reverse order of the social object's final reference tree size. This paper reveals that the query algorithm using the index needs to access the upper bound of the amount of data. The statistical analysis on the real data set shows that the upper bound of the number of access data is sublinear with the K value in most cases. This paper further proposes a piecewise maximum approximate summary, which can predict each object more accurately in the query window. The upper boundary of the tree size is quoted in order to avoid the disk access generated by the actual value of a hot object in a non hot state. A log structure octree index is proposed to solve the real-time temporal keyword query. The other feature of social media data is the high-speed generation of user data, which is a phenomenon. It is particularly prominent during hot events. Therefore, it is important to quickly index the data and reflect it to the query results in the face of an online index scene, whether to improve the user experience of the ordinary user, or to provide timely data support for the quick decision. This article introduces the reference time series of each social object. The approximate approximate segment data is mapped to the point in the three-dimensional space, and the octree is used to maintain the locality in the importance and time dimension of the social object in the index. The encoding method of the octree node makes the index not only support the data filtering of the time dimension, but also guarantee the return of the data required by the temporal threshold algorithm. The combination of the merging tree with the log structure, fully utilizing the fast and disk sequence read-write efficiency of the memory access, implements the rapid index of social media data. In the full volume micro-blog behavior data, the change of information propagation behind the rise and fall of sina micro-blog is analyzed. The temporal keyword query is used to improve the accuracy of the data extraction rules in this analysis process and help to cover more comprehensive data. The logarithmic Gauss model is proposed by using the modeling of a single micro-blog forwarding time sequence. Based on the method of fitting the parameters of a group of micro-blog forwarding models, this paper points out a statistic related to the speed of information propagation. This paper further defines the behavior characteristics of the users on the Sina micro-blog platform, as well as the external characteristics that reflect the attitude of the entire network users to the social platforms, and analyzes their changing trends. And explore the relationship between them and the statistics reflecting the information dissemination. Finally, this paper systematized the full text related technology and constructed an online analysis platform of real-time micro-blog data stream based on Sina micro-blog. It can cluster the results of the temporal keyword search search into a topic, and display the preliminary statistics of the topic from several dimensions. In summary, this paper extends the function of keyword search on social media data, proposes temporal keyword query, and explores the organization structure and query arithmetic of index from two aspects of social media data characteristics and index updating efficiency. Two analysis applications based on this query It can be more flexible to adapt to various application scenarios, help users excavate important information from social media data, and provide data base for further complex analysis tasks. The open access system at the end of this paper implements the index and analysis technology in the text, and makes researchers and analysts in various fields. People can benefit from massive real-time social media data.
【学位授予单位】:华东师范大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP391.3;TP393.09

【相似文献】

相关期刊论文 前10条

1 梁银;董永权;;基于对象集合的空间关键词查询[J];计算机应用;2014年07期

2 张颖;李昕;;一种关系数据库上的关键词查询排序方法[J];辽宁工业大学学报(自然科学版);2013年05期

3 寇苏玲;蔡庆生;;应用于用户兴趣建模的多文本关键词抽取研究[J];计算机仿真;2007年02期

4 林子雨;杨冬青;王腾蛟;张东站;;基于关系数据库的关键词查询[J];软件学报;2010年10期

5 林子雨;邹权;赖永炫;林琛;;关系数据库中的关键词查询结果动态优化[J];软件学报;2014年03期

6 李益民;;一种大规模Deep Web查询重构技术[J];情报科学;2014年01期

7 李慧颖;瞿裕忠;;基于关键词的RDF数据查询方法[J];东南大学学报(自然科学版);2010年02期

8 杨书新;徐慧琴;;基于数据图的关系数据库关键词查询排序研究[J];计算机应用研究;2014年02期

9 海沫;郭树行;;网络环境中基于语义聚类的多关键词查询机制[J];图书情报工作;2012年20期

10 安镇宙;杨鉴;仇汶;;一种新的基于分层查询表的关键词识别模型[J];计算机工程与应用;2008年02期

相关会议论文 前3条

1 修慧兰;;台湾大学生个人竞争力之相关研究[A];全国教育与心理统计与测量学术年会暨第八届海峡两岸心理与教育测验学术研讨会论文摘要集[C];2008年

2 杨艳;何天宇;;基于短语的关系数据库关键词查询方法[A];第29届中国数据库学术会议论文集(B辑)(NDBC2012)[C];2012年

3 李_,

本文编号:2166788


资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2166788.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户ecdd6***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com