基于新浪微博的冰雹实况信息提取方法研究
发布时间:2018-09-04 17:39
【摘要】:冰雹作为一种强破坏性天气,给人们带来巨大的创害,所以冰雹的研究关系重大。目前已有冰雹识别预测的研究,但预测结果的准确与否需要实际冰雹发生事件来验证。但是传统的这实际的冰雹实况信息都是单纯的依靠专门的气象人员,而这种方法存在时间和地域的局限性。为更加方便快捷地搜集冰雹实况信息,我们将目光转移到现代互联网。其中,新浪微博时全国用户使用量最大、活跃度最高的微博平台。加之作为一种罕见极端天气,人们倾向于在微博上发表冰雹天气的相关信息,于是我们选择从新浪微博搜集所需信息。目前有许多关于新浪微博数据采集的方法,总结来看这些方法有:基于第三方软件或者第三方微博数据集的方法、基于新浪公开API的方法和网络爬虫抓取的方法。鉴于本课题需要用到新浪微博的高级搜索接口,而新浪又无该接口的公开获取途径,最后采用网络爬虫技术抓取设定搜索条件的页面,进而抓取含有“冰雹”这一关键字的微博数据。采集到的微博数据并非都是描述冰雹发生信息的数据,根据观察,一部分数据是描述冰雹发生事件,一部分是天气预报信息可能发生冰雹天气,其他则是不含有冰雹发生事件的数据,为从这些数据中获得冰雹实际放生的数据,为将实际含有冰雹实况的数据识别出来,本文采用文本分类技术。文本分类之前采用人工标注的方法构建了三类数据的样本空间。其中文本分类的关键在于文本特征的提取,本文对目前文本特征主要的几种方法进行了说明并在其基础上进行调整,最后将各种方法综合起来使用,通过实验验证了综合使用的结果比使用单一方法更好。之后对传统单纯的词语特征扩展,将词组也作为文本分类的特征。本文采用贝叶斯,K近邻,和支持向量机三种分类器,给出了基于3分类器的组合分类方案。测试结果表明,本文方法能够将隐含在新浪微博中的降雹事件的89.5%提取出来,误识信息低于13.4%。最后利用基于规则的模板匹配法对识别出包含冰雹事件的微博文本进行基于句子级的冰雹发生时间、地点、大小信息的提取。
[Abstract]:Hail, as a kind of strong destructive weather, brings great harm to people, so the research of hail is of great importance. At present, hail recognition and prediction have been studied, but the accuracy of the prediction results need to be verified by actual hail events. But the traditional actual hail information is simply dependent on specialized meteorological personnel, and this method has the limitation of time and region. In order to collect hail information more conveniently and quickly, we turned our eyes to the modern Internet. Among them, Sina Weibo when the use of users in the country the largest, the highest level of activity Weibo platform. In addition, as a rare extreme weather, people tend to publish information on hail weather on Weibo, so we choose to collect the necessary information from Sina Weibo. At present, there are many methods about data collection of Sina Weibo. These methods are summarized as follows: methods based on third-party software or third-party Weibo datasets, methods based on Sina open API and methods of crawler crawling. In view of the fact that this subject needs to use Sina Weibo's advanced search interface, and Sina does not have the open access to this interface, finally, the web crawler technology is used to grab the pages that set the search conditions. And then grab the keyword containing "hail" Weibo data. Weibo's data collected are not all data describing hail occurrence information. According to observation, some of the data describe hailstorm events, and part are weather forecast information that may occur hail weather. In order to obtain the actual release data of hail from these data and to recognize the actual hail events, the text classification technique is used in this paper. The sample space of three kinds of data is constructed by manual annotation before text classification. The key of text classification is the extraction of text feature. This paper explains and adjusts the main methods of text feature. The experimental results show that the result of comprehensive use is better than that of single method. After that, the traditional simple word features are extended, and the phrase is also used as the feature of text classification. In this paper, three kinds of classifiers, Bayesian K-nearest neighbor and support vector machine, are used to give a combined classification scheme based on 3-classifier. The test results show that this method can extract 89.5% of hail events hidden in Sina Weibo, and the misinformation is less than 13.4. Finally, the rule based template matching method is used to extract the information of hail occurrence time, place and size based on sentence level for Weibo text which includes hail events.
【学位授予单位】:天津大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP393.092;TP391.1
[Abstract]:Hail, as a kind of strong destructive weather, brings great harm to people, so the research of hail is of great importance. At present, hail recognition and prediction have been studied, but the accuracy of the prediction results need to be verified by actual hail events. But the traditional actual hail information is simply dependent on specialized meteorological personnel, and this method has the limitation of time and region. In order to collect hail information more conveniently and quickly, we turned our eyes to the modern Internet. Among them, Sina Weibo when the use of users in the country the largest, the highest level of activity Weibo platform. In addition, as a rare extreme weather, people tend to publish information on hail weather on Weibo, so we choose to collect the necessary information from Sina Weibo. At present, there are many methods about data collection of Sina Weibo. These methods are summarized as follows: methods based on third-party software or third-party Weibo datasets, methods based on Sina open API and methods of crawler crawling. In view of the fact that this subject needs to use Sina Weibo's advanced search interface, and Sina does not have the open access to this interface, finally, the web crawler technology is used to grab the pages that set the search conditions. And then grab the keyword containing "hail" Weibo data. Weibo's data collected are not all data describing hail occurrence information. According to observation, some of the data describe hailstorm events, and part are weather forecast information that may occur hail weather. In order to obtain the actual release data of hail from these data and to recognize the actual hail events, the text classification technique is used in this paper. The sample space of three kinds of data is constructed by manual annotation before text classification. The key of text classification is the extraction of text feature. This paper explains and adjusts the main methods of text feature. The experimental results show that the result of comprehensive use is better than that of single method. After that, the traditional simple word features are extended, and the phrase is also used as the feature of text classification. In this paper, three kinds of classifiers, Bayesian K-nearest neighbor and support vector machine, are used to give a combined classification scheme based on 3-classifier. The test results show that this method can extract 89.5% of hail events hidden in Sina Weibo, and the misinformation is less than 13.4. Finally, the rule based template matching method is used to extract the information of hail occurrence time, place and size based on sentence level for Weibo text which includes hail events.
【学位授予单位】:天津大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP393.092;TP391.1
【相似文献】
相关期刊论文 前5条
1 乔蓉;韩通;;兰州冰雹特征的统计分析[J];成都信息工程学院学报;2007年02期
2 李哲;周筠s,
本文编号:2222876
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2222876.html