当前位置:主页 > 管理论文 > 移动网络论文 >

基于机器学习的网络舆情采集技术研究与设计

发布时间:2018-04-08 09:54

  本文选题:网络舆情 切入点:机器学习 出处:《电子科技大学》2014年硕士论文


【摘要】:随着互联网技术的飞速发展,网络平台的重要性愈发突出,网络中虚假、暴力、消极的网络舆情对社会稳定和国家安全的影响也越来越大。针对网络舆情进行有效采集在预防不良信息的传播,稳定社会秩序,保证公共安全方面有着重要意义。本文重点研究分析及改进了网络舆情采集系统的关键技术:文本聚类,设计并实现了一个网络舆情采集原型系统。1、本文对文本聚类中的Single-Pass算法进行了改进。作为基于机器学习的网络舆情采集技术,无监督机器学习的文本聚类算法是其核心。Single-Pass算法虽然对网络信息的话题提取有较为优异的性能,但是该聚类算法对于文本输入顺序的依赖性较强,对于相同的数据集,输入数据不同可能导致聚类结果的差异。本文设计了一种基于双阈值的Single-Pass算法,通过建立中间状态规范簇类中心向量的偏移来降低对输入顺序的依赖性强度。此次改进通过实验证明对文本聚类的性能有较大提升。2、本文改进了基于DOM树改进的正文提取方式,该方式结合中文字符和非链接文字的分布比率来优化传统的基于DOM树的正文提取方法,提升了舆情采集系统中正文提取的精确性。3、本文构建了基于机器学习的网络舆情采集系统架构,设计并实现了原型系统,并对其核心模块和系统整体进行测试。
[Abstract]:With the rapid development of Internet technology, the importance of network platform becomes more and more prominent. The influence of false network, violence and negative network public opinion on social stability and national security is also increasing.Effective collection of network public opinion is of great significance in preventing the spread of bad information, stabilizing social order and ensuring public safety.This paper focuses on the analysis and improvement of the key technology of the network public opinion collection system: text clustering, designs and implements a network public opinion collection prototype system. This paper improves the Single-Pass algorithm in text clustering.As a network public opinion collection technology based on machine learning, unsupervised machine learning text clustering algorithm is its core. Single-Pass algorithm has excellent performance for topic extraction of network information.However, the clustering algorithm is strongly dependent on the order of text input. For the same data set, different input data may lead to the difference of clustering results.In this paper, a Single-Pass algorithm based on double threshold is designed to reduce the dependence on the input order by establishing the shift of the center vector of the intermediate state specification cluster.The improvement has been proved by experiments to improve the performance of text clustering greatly. This paper improves the text extraction method based on DOM tree.This method combines the distribution ratio of Chinese characters and unlinked text to optimize the traditional text extraction method based on DOM tree.Improve the accuracy of text extraction in the public opinion collection system. This paper constructs the network public opinion collection system architecture based on machine learning, designs and implements the prototype system, and tests its core module and system as a whole.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.08;TP181

【参考文献】

相关期刊论文 前1条

1 陈玉芳,葛燧和;一个基于XML的WEB数据收集模型的研究[J];计算机工程与应用;2004年10期

相关硕士学位论文 前1条

1 莫卓颖;基于语义DOM的WEB信息抽取[D];广西师范大学;2012年



本文编号:1721102

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1721102.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户da447***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com