基于股价的情感词库获取

发布时间：2018-05-19 20:25

本文选题：主题模型 + 趋势概率模型　；参考：《西南财经大学》2014年硕士论文

【摘要】：随着互联网的发展,越来越多的网民习惯从互联网获取信息,越来越多的企业开始试图从网络中获取经验相关的信息。互联网已经成为继报纸,广播,电视之后的“第四媒体”。互联网因其便捷性,成为人们获取信息的首要来源。同时,多种社交媒体的出现,例如微博,朋友圈,facebook,twitter的出现,使人们可以大量发表自己观点。这些观点,对于企业有着重要的意义。这些观点可以帮助企业知道用户对其商品的观点,可以帮助企业知道其对手对自己商品的观点。这些信息可以帮助电影院进行电影票房预测。同时,这些信息也可以帮助人们更好了解自己生活的舆论等。情感分析(sentiment analysis)就是用来完成以上任务的一种技术。情感分析主要是用来解决谁对什么东西的什么方面有什么观点。涉及主体——人,客体——特征,观点——情感词等。情感分析)又被称为观点发现(opinion find)。是从大量文本中找到主观信息。例如,某人关于某事物的评价。某人对于某个观点的意见等。其中,情感词库建立是情感分析的重要组成部分。本文主要研究两个问题：第一,情感词库是与特定领域相关的,不同领域的情感词库具有明显不同。同一个词汇,在不同情感词库中,可能有着不同的情感色彩。如何自动化的建立一个金融情感词库呢?第二,情感词库的所有情感词汇并不是都具有相同的情感色彩,如何对这些情感词进行排名呢? 本文将自然语言处理技术与金融相关技术结合,力图解决以上问题。首先,本文研究了基础了自然语言处理技术；然后建立了基于以上理论基础的系统。最后通过实验,研究不同参数对于情感词库研究的影响。论文主要包括五个章节的内容：第一章,绪论。介绍了国外相关学者对于本课题的研究现状。阐述了本文的研究方法和研究思路。第二章,相关知识。介绍了常用的自然语言处理技术。常用的文本分类技术以及其数学原理。第三章,系统实现。介绍了本系统的开发与实现。介绍了基于lucene的整体系统开发,分词,索引,以及文本自动生成技术。第四章,算法与实验,本部分阐述了基于PLSA的Trend-PLSA算法。词算法将趋势与PLSA进行融合,将元数据与概率图模型相结合,从而提高情感词库的正确率。最后,本部分阐述了不同实验参数对于情感词库建立的影响。第五章,总结和期望。首先总结了本文的主要工作,以及本文的主要贡献。最后提出了未来研究的新方向和新思路。本文采用如下技术进行研究：首先,本文采用了自然语言处理技术。自然语言处理技术是一门计算机与语言学相结合的交叉学科。自然语言处理技术致力于让机器理解人类的语言,如TF-IDF求值,主题模型,文本向量化方法,索引建立等。其次,本文采用了定性与定量相结合的技术。本文所研究的对象是情感分析。情感词归类本身属于一个定性的问题,将给定的词汇归属到指定类中。对于给定的情感词找到所属的情感类型即可。同时,本文也给每个情感词一个定量的数值,对所有的情感词进行排序,这个值的绝对值越大表明情感词的感情色彩越强。本文处理的股价信息是一个定量的数据,通过相关算法,本文把定量的数据转化为定性的信息,通过这样定性的信息,进行情感词判断。总之,通过定性与定量相结合的方法,提高了情感词库的正确性,也提高了情感词库的实用性。通过实现,本文发现,本文所提出的情感词生成算法具有较强的实用性。相比其他的情感词提取算法,本文提出的情感词生成算法正确率较高。本文的创新之处,可以通过如下方面进行阐述。本文的创新之处主要是算法和技术上的创新。首先,本文不需要预先选定种子词汇,所谓的种子词汇,就是预先选择的词汇。情感词库常规生成方法,要先选定若干的种子词汇。如果没有良好的种子词汇,所有的情感词库只能是水中花,镜中月。优秀的种子词汇,是高质量情感词库生成的保证。好的情感词库使得情感词库具有较强的泛化能力。对于特定领域的情感词库建立,“种子”词汇的选择需要选择者具有很好的专家素养。从经济角度分析,雇佣这些专家来进行种子词汇挑选的费用也是相当昂贵的。同时,这些词汇应当具有普遍性,有较强的情感词性。但这两者通常是互相矛盾的,这样的任务对于专家而言也并不是一项轻易的工作而本文所提出的算法,是一种非监督式学习的算法,这种算法不需要预先知道任何与情感有关的词汇。即不需要知道种子词汇。从而大大减少了情感词库建立的费用,加速了情感词库生成的速度。其次,词语的情感性是随着时间变化而变化的,新的情感词不断涌现。旧的词汇又会有新的情感词性。现有的算法不具有这种随时间变化而自动变化的自适应能力。本文所设计的系统,可以不断的从网上获取股价数据,自动的将股价数据与文本进行匹配,从而可以随时间变化不断生成新的情感词。这样生成的情感词库具有很强的时效性。然后,同一个词汇在不同领域中具有不同的情感色彩。不同领域的情感词有着不同的排名。本文通过排序算法,对所有的情感词进行了排序。最后,本文提出了基于隐含语义分析算法的趋势-隐含语义分析算法。本文实验了简单贝叶斯算法。对比了简单贝叶斯算法和隐含语义分析算法的实验效果。实现结果显示,本算法相比其他算法相比,能更好的利用股价信息,从而做出更准确的情感词归类,构建更为优秀的情感词库。
[Abstract]:With the development of the Internet, more and more netizens get used to obtain information from the Internet. More and more enterprises have begun to try to obtain the information related to the Internet. The Internet has become the "fourth media" after the newspaper, radio and television. The Internet has become the primary source of information for people. The emergence of social media, such as micro-blog, circle of friends, Facebook, and twitter, makes it possible for a large number of people to publish their views. These ideas are important to the business. These ideas help companies to know their views on their goods and help the business know their opponents' views on their goods. To help the cinema to make a movie box office prediction. At the same time, the information can also help people to better understand the public opinion of their lives. Sentiment analysis is a technique used to accomplish the above tasks. People, objects, features, opinions, emotional words, emotional analysis, and emotional analysis are also known as opinion find. It is to find subjective information from a large number of texts. For example, a person's evaluation of something. Someone's opinion on a point of view. Among them, the establishment of an emotional lexicon is an important part of the emotional analysis.
This paper mainly studies two questions: first, the emotional lexicon is related to a particular field. The emotional lexicon in different fields is distinctly different. The same word, in the different emotional lexicon, may have different emotional colors. How to automate the establishment of a financial emotional word library? Second, all emotional words are not in the emotional lexicon. All have the same emotional color, how to rank these emotional words?
In this paper, Natural Language Processing technology and financial related technology are combined to solve the above problems. First, this paper studies the foundation of Natural Language Processing technology, and then establishes a system based on the above theoretical basis. Finally, through experiments, the influence of different parameters on the research of emotional lexicon is studied.
This paper mainly includes five chapters:
The first chapter, introduction, introduces the research status of foreign scholars on this topic, and expounds the research methods and research ideas of this paper.
The second chapter, related knowledge, introduces the commonly used Natural Language Processing technology, the commonly used text classification technology and its mathematical principle.
The third chapter, system implementation, introduces the development and implementation of the system. It introduces the overall system development, segmentation, indexing, and text automatic generation technology based on Lucene.
The fourth chapter, algorithm and experiment, this part expounds the Trend-PLSA algorithm based on PLSA. The word algorithm combines the trend with the PLSA, and combines the metadata with the probability map model, thus improving the correct rate of the emotional lexicon. Finally, this part expounds the influence of different experimental parameters on the establishment of emotional lexicon.
The fifth chapter summarizes and expects. First, it summarizes the main work of this paper and the main contributions of this paper. Finally, it puts forward new directions and new ideas for future research.
This paper studies the following techniques:
First of all, this article uses Natural Language Processing technology. Natural Language Processing technology is a cross subject that combines computer and linguistics. Natural Language Processing technology is committed to making machines understand human language, such as TF-IDF evaluation, theme model, text to quantization method, cable indexing and so on.
Secondly, this paper uses a combination of qualitative and quantitative techniques. The object of this paper is emotional analysis. The classification of emotional words itself belongs to a qualitative problem, which belongs to a given class. The emotional type of a given emotion word can be found. At the same time, this article also gives each emotional word a quantitative value. The greater the absolute value of the value, the greater the absolute value of the value indicates that the emotional color is stronger. The stock price information dealt with in this article is a quantitative data. Through the relevant algorithms, the quantitative data is converted into qualitative information and the qualitative information is used to judge the emotional words. In a word, the qualitative and quantitative phases are made. The combination method improves the correctness of emotional lexicon and improves the practicability of emotional lexicon.
Through the implementation, this paper finds that the algorithm proposed in this paper is more practical. Compared with other affective word extraction algorithms, the algorithm proposed in this paper has a higher accuracy.
The innovation of this paper can be explained through the following aspects. The innovation of this article is mainly the innovation of algorithm and technology.
First, this article does not need to choose seed words in advance. The so-called seed vocabulary is a preselected vocabulary. The common generation method of emotional lexicon is to select a number of seed words. If there is no good seed vocabulary, all the emotional lexicon can only be water flower, mirror moon. Excellent seed vocabulary, high quality emotional lexicon generation. Guarantee. Good emotional lexicon makes the emotional lexicon highly generalization. For the establishment of a particular domain of emotional lexicon, the choice of "seed" vocabulary needs a good expert attainment. From an economic perspective, the cost of hiring these experts for seed vocabulary selection is also quite expensive. Remittance should be universal and have strong emotional words. But the two are usually contradictory, and such a task is not an easy task for experts. The algorithm proposed in this paper is an unsupervised learning algorithm, which does not need to know any emotion related vocabulary in advance. That is, it is not necessary to know. Thus, the cost of establishing emotional lexicon is greatly reduced, and the speed of generating emotional lexicon is accelerated.
Secondly, the emotion of the words is changed with time, the new emotion words are constantly emerging. The old words will have new emotional words. The existing algorithms do not have the self-adaptive ability to change automatically with time. The system designed in this paper can continuously obtain stock data from the Internet and automatically make the stock price data. Matching with the text, it can generate new emotional words over time. This generated emotional lexicon has a strong timeliness.
Then, the same word has different emotional colors in different fields. The emotion words in different fields have different ranking. In this paper, all the emotional words are sorted by sorting algorithm.
Finally, this paper puts forward the trend implicit semantic analysis algorithm based on the implicit semantic analysis algorithm. In this paper, the simple Bias algorithm is experimented. The experimental results of the simple Bias algorithm and the implicit semantic analysis algorithm are compared. The results show that the algorithm can make better use of the stock price information compared with other algorithms and make the more accurate. Classify the emotional words and construct a better emotional lexicon.
【学位授予单位】：西南财经大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：F832.51;TP391.1

【相似文献】