基于数据挖掘的热点新闻发现及系统方法研究

发布时间：2018-03-28 09:30

本文选题：热点事件　切入点：文本聚类　出处：《湖北工业大学》2017年硕士论文

【摘要】：互联网新闻已经成为用户获取信息的一个重要来源。新型的网络资源和网络新闻应用不断增加,网络新闻数目呈现爆炸式增长,给用户阅读新闻增加了很多困难,从大量的网络新闻中发现和分析热点事件成为急需解决的重要问题。尽管机器学习、自然语言处理等多方面的技术已经在网络热点事件发现中得到了广泛的应用,但是现有的文本表示模型存在相对局限性,使得文本表示的性能仍不能让用户满意,还有很多问题需要进一步研究。为了实现更加深入的理解文本的目的,本文基于句义结构模型构建了一种基于聚类的互联网热点事件发现方法。该方法首先对文档进行句义成分分析,计算词的权重后生成语义向量;将语义向量用到热点事件发现系统中,采用single-pass聚类思想和凝聚式层次聚类与K-means聚类算法相结合的聚类算法,事件发现准确率为75.2%。此外,构建了一种事件简化表示的方法,抽取事件发展关键点和事件标签,事件发展关键点的准确率为58.9%。此外,设计并实现了一种热点事件发现和事件简化表示原型系统。
[Abstract]:Internet news has become an important source of information for users. New types of network resources and network news applications are constantly increasing, and the number of network news is exploding, making it more difficult for users to read news. Finding and analyzing hot events from a large number of network news has become an important problem that needs to be solved. Although machine learning, natural language processing and other technologies have been widely used in the discovery of network hot events, However, the existing text representation model has relative limitations, which makes the performance of text representation still not satisfactory to users, and there are still many problems that need to be further studied. In order to achieve a deeper understanding of the text, In this paper, a clustering based method for detecting hot Internet events is proposed based on the sentence meaning structure model. Firstly, the semantic component of the document is analyzed and the semantic vector is generated by calculating the weight of the words. The semantic vector is used in the hot spot event discovery system, and the clustering algorithm which combines the single-pass clustering idea with the condensed hierarchical clustering algorithm and the K-means clustering algorithm is adopted. The accuracy of event discovery is 75.2. In addition, a simplified representation method of events is constructed. The accuracy rate of event development key points is 58.9. In addition, a prototype system of hot spot event detection and event simplified representation is designed and implemented.
【学位授予单位】：湖北工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】