基于CRF的中文微博交通信息事件抽取

发布时间：2018-09-04 15:22

【摘要】：在自然语言处理领域，事件抽取和追踪一直是一个非常重要的研究方向。如何准确高效地从大量繁杂无序的信息中提取到感兴趣的事件信息,是事件抽取研究领域的关键问题。本课题选择抽取的对象文本来源于著名的中文微博媒体——新浪微博。微博，即“Microblog”，是一个基于用户关系的分享，传播以及获取信息的平台。人们每天发布上百万条微博。作为一种新兴媒体，微博中蕴含了海量的信息，是当前各类大数据研究的绝佳平台。与城市交通信息有关的微博常常提及诸如事故信息，堵车信息，道路施工信息。这些微博蕴含的信息往往具有很高的准确性和时效性，通过有针对性的抓取，，排除噪音，事件抽取，我们将能得到覆盖整个城市交通网的实时信息来源。然而，传统的标准自然语言处理工具针对中文微博文本的处理不尽人意，因此，本文描述了本课题构建的一整套系统方案，实现从抓取微博，去除噪音，微博话题限定，句子分割，词性标注，命名实体识别，事件抽取到事件展示的过程。本课题使用了基于条件随机场概率模型CRF和基于规则的正则表达式相结合的办法进行自然语言处理，使用python作为主要开发语言。实验结果表明，经测评分析得出的最优方案能以达83%的准确率提取微博文本中的事件要素；微博文本标准化处理方法能够有效的提升后期事件抽取的准确率；系统最终能能实时的展示出所提取的信息。
[Abstract]:In the field of natural language processing, event extraction and tracking has been a very important research direction. How to accurately and efficiently extract the interesting event information from a large number of complex and unordered information is a key issue in the field of event extraction. The object text selected in this paper comes from the famous Chinese Weibo media-Sina Weibo. Weibo, or "Microblog", is a user-based sharing, dissemination and access to information platform. People publish millions of Weibo every day. As a new media, Weibo contains a great deal of information, which is a perfect platform for all kinds of research. Weibo, who is concerned with urban traffic information, often refers to accident information, traffic jam information and road construction information. The information contained by Weibo often has high accuracy and timeliness. Through targeted grabbing, noise elimination and event extraction, we will be able to obtain real-time information sources covering the entire urban traffic network. However, the traditional standard natural language processing tools are not satisfactory for the Chinese text of Weibo. Therefore, this paper describes a whole set of system schemes constructed in this paper, which can achieve the goal of grasping Weibo, removing noise, and limiting the topic of Weibo. Sentence segmentation, part of speech tagging, named entity recognition, event extraction to event presentation process. In this paper, the method of combining conditional random field probability model (CRF) and regular expression based on rules is used to process natural language, and python is used as the main development language. The experimental results show that the optimal scheme can extract the event elements of Weibo text with the accuracy of 83%, and the standardized processing method of Weibo text can effectively improve the accuracy of later event extraction. Finally, the system can display the extracted information in real time.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1;TP393.092

【参考文献】