web舆情信息自动化采集系统的设计与实现

发布时间：2018-04-20 15:39

本文选题：舆情采集 + 信息提取　；参考：《电子科技大学》2014年硕士论文

【摘要】：舆情作为群众对于社会中存在的某些事件的观点和态度的集合,对政府维护社会的稳定、了解社会存在的问题,提高政府公信力有积极的作用。同时,舆情对公司准确及时掌握客户对公司产品和服务的看法和建议,提升产品和服务的质量,增强公司的综合竞争力有深远的战略意义。Web 2.0的兴起,为Web舆情信息的自动化采集带来了重大发展机遇,同时也对采集技术提出了新的挑战。Web信息作为舆情信息的主要载体,因此,解决该类信息的采集问题,显得更加迫切。从现有的研究成果来看,Web舆情采集需要解决海量数据挖掘,数据实时分析以及数据分析的准确性等问题。本文首先对现有的Web信息抽取技术的国内外研究现状做了概要的总结,然后对目前已有的研究成果进行了详细的分析。结合实际项目的需要,提出了自己的web舆情信息采集方法。主要研究内容如下:1.研究已有的信息采集模型和采集算法,并对它们的功能和优缺点进行了对比和分析。采集模型主要包括理解模型、对象模型和视觉模型,采集算法包括本体论算法、马尔可夫算法等,总结比较全面。2.研究并提出了可视化信息采集模板生成技术,将用户操作行为(包括点击下一页超链接或者按钮、点击网页某个元素、下拉列表等)转化为采集模板,降低了模板的制作难度,并提高了模板的制作效率。3.实现了基于DOM树和行块分布函数的网页正文提取子系统,应用了xpath和正则表达式等相关技术,系统综合采用了统计与规则相结合的方法来解决系统的通用性问题。4.实现了对采集到的web信息进行聚类分析等数据处理过程,最终为用户提供了舆情浏览、热点话题发现等综合舆情服务。
[Abstract]:As a collection of public opinion and attitude towards some events in society, public opinion has a positive effect on the government to maintain social stability, understand the problems existing in society, and improve the credibility of the government. At the same time, public opinion has far-reaching strategic significance for the company to accurately and timely grasp the customer's views and suggestions on the company's products and services, improve the quality of the products and services, and enhance the comprehensive competitiveness of the company. It brings great development opportunity for the automatic collection of Web public opinion information. At the same time, it also puts forward new challenges to the collection technology. Web information is the main carrier of public opinion information. Therefore, it is more urgent to solve the problem of collecting this kind of information. From the existing research results, we need to solve the problems of mass data mining, real-time data analysis and accuracy of data analysis. In this paper, the current research status of Web information extraction technology at home and abroad is summarized, and then the existing research results are analyzed in detail. According to the need of the actual project, this paper puts forward its own method of collecting web public opinion information. The main research contents are as follows: 1. The existing information collection models and algorithms are studied, and their functions, advantages and disadvantages are compared and analyzed. The acquisition model mainly includes understanding model, object model and visual model. The collection algorithm includes ontology algorithm, Markov algorithm and so on. This paper studies and puts forward the technology of creating visual information collection template, which converts the user's operation behavior (including clicking on the next page hyperlink or button, clicking on a page element, drop-down list, etc.) into a collection template, which reduces the difficulty of making the template. It also improves the efficiency of template making. A web page text extraction subsystem based on DOM tree and row block distribution function is implemented. The related techniques such as xpath and regular expression are applied. The method of combining statistics and rules is adopted to solve the universal problem of the system. The process of data processing such as clustering analysis of collected web information is realized. Finally, the comprehensive public opinion services such as browsing of public opinion, hot topic discovery and so on are provided for users.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】