基于Spark的社交主题分析与应用

发布时间：2018-01-06 08:31

本文关键词：基于Spark的社交主题分析与应用　出处：《电子科技大学》2016年硕士论文　论文类型：学位论文

【摘要】：自然语言处理被认为是大数据时代十分关键的技术之一,尤其对于互联网上的“用户生成内容”进行文本分析蕴含着巨大的商业价值。主题模型是一类无监督的文本处理方法,其发展经历了从LSI模型到p LSI模型,再到LDA模型的研究阶段。尽管用LDA模型进行主题挖掘已经得到了广泛的实际应用,但数据规模变大后效率明显降低,在数据处理过程中,有效数据覆盖度和执行效率难以兼顾。随着分布式系统的发展,大规模数据计算已经得到广泛的运用。近两年发展起来的Spark平台凭借着基于内存计算的优势,在大规模数据机器学习领域受到了广泛的青睐。原因是将中间计算结果保留在缓存,这种做法非常适合运用到机器学习模型的反复迭代过程之中。这一技术为解决大规模数据主题挖掘的低效率问题奠定了基础。但LDA模型中Gibbs采样的每一步执行都强依赖于其他步的执行结果,如果简单地将其分块后并行处理,过程中并行修改同一统计量直接破坏了变量的一致性,而若将变量异步更新则失去了并行化的意义。可见,强依赖每步执行状态的算法模型较难并行化,这也是为何发展迅速的Spark平台上,机器学习库MLlib中的算法依然十分稀少的主要原因。因此,LDA模型的并行化过程存在较大的难度。为了解决上述问题,本文利用LDA模型中各文档及各词语独立分布的假设条件,和Gibbs采样过程各变量依赖更新的特点,创新性提出了解决方案,降低了LDA模型并行化过程中不一致性带来的影响,明显的提高了LDA模型的效率。该解决方案包含:(1)对原始数据集重构方法;(2)对执行过程的阶段性划分方法;(3)阶段内计算和阶段间变量同步的策略。具体的做法是:根据设定的并行度P和建立的词汇表,将数据集分块,进而将其划分到计算过程的P个阶段之中,保证每一个阶段选择P个依赖度最小的数据块,然后阶段内并行采样,阶段间变量同步。通过以上的方案计算直至模型收敛,得到主题分布结果。本文工作有效的解决了LDA模型在并行化中遇到的理论瓶颈,极大地改善了并行运算中数据块间的变量不一致性情况,为LDA模型的并行化提供了理论依据。该方法也给同类强依赖每一步状态的算法实现并行化提供了思路。此外,本文利用Spark平台实现了LDA主题模型的并行化。在这基础之上,考虑新浪微博文本内容特征,采用以用户为单元将微博内容聚合为长文本、清洗转发内容、TF-IDF过滤无效词等多种处理方法提升模型效果,最终形成了一套高效的社交主题分析系统,其性能与使用标准LDA模型进行主题分析相比大幅提升,可供企业进行高效的微博社交数据主题挖掘。进一步地,可泛化用以分析其他社交平台数据。该分析系统的主题产出结果在品牌营销的应用场景中也能提供数据支持,助力品牌商企业科学发展。
[Abstract]:Natural language processing (NLP) is considered to be one of the key technologies in the big data era. Especially for the "user-generated content" text analysis on the Internet contains great commercial value. Topic model is a kind of unsupervised text processing method. Its development has gone through the research stage from LSI model to p LSI model, and then to LDA model. Although using LDA model for topic mining has been widely used in practice. However, the efficiency of the data becomes larger and the efficiency decreases obviously. In the process of data processing, the effective data coverage and the execution efficiency are difficult to be taken into account. With the development of distributed system. Large-scale data computing has been widely used. The Spark platform developed in recent two years is based on the advantage of memory-based computing. In the field of large-scale data machine learning, the reason is that the intermediate computing results are kept in the cache. This approach is very suitable for repeated iterations of machine learning models. This technique lays the foundation for solving the inefficient problem of large scale data topic mining. But Gibbs sampling in LDA model. Each step execution is strongly dependent on the execution results of the other steps. If it is simply partitioned into blocks and processed in parallel, the parallel modification of the same statistics directly destroys the consistency of variables, and if the variables are updated asynchronously, it loses the significance of parallelization. It is difficult to parallelize the algorithm model which strongly depends on the execution state of each step, which is the main reason why the algorithms in the machine learning library (MLlib) are still very rare on the rapidly developing Spark platform. The parallelization of LDA model is difficult. In order to solve the above problems, this paper makes use of the hypothesis of independent distribution of documents and words in LDA model. And Gibbs sampling process variables dependent on the characteristics of update, innovative solutions to reduce the LDA model in the parallelization process caused by inconsistency. The efficiency of LDA model is improved obviously. The solution includes: 1) reconstruction of raw data set; (2) the method of dividing the stages of the execution process; The method is to divide the data sets into blocks according to the set degree of parallelism P and the established vocabulary, and then divide the data sets into P stages of the calculation process. Make sure that each stage selects P data blocks with the least dependence, and then samples in parallel, synchronizes the variables between stages. Through the calculation of the above scheme, the model converges until the model is converged. The results of topic distribution are obtained. In this paper, the theoretical bottleneck of LDA model in parallelization is solved effectively, and the inconsistency of variables between data blocks in parallel operation is greatly improved. This method provides a theoretical basis for the parallelization of LDA model. This method also provides a way to realize parallelization of similar algorithms strongly dependent on each step of the state. In this paper, we use Spark platform to realize the parallelization of LDA theme model. On this basis, considering the content features of Sina Weibo text, we aggregate Weibo content into long text with user as the unit. Cleaning and forwarding content TF-IDF filter invalid words and other processing methods to improve the effectiveness of the model, and finally formed a set of efficient social theme analysis system. Compared with using standard LDA model for topic analysis, its performance is greatly improved, which can provide enterprises with efficient Weibo social data topic mining. It can be used to analyze the data of other social platforms. The thematic output of the analysis system can also provide data support in the application of brand marketing, which can help the scientific development of brand companies.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【参考文献】