中小学网站教育信息化话题发现与趋势分析
发布时间:2018-04-16 14:21
本文选题:教育信息化 + 热点话题发现 ; 参考:《南京师范大学》2016年硕士论文
【摘要】:教育信息化是一个国家和地区教育发展程度的重要象征。随着互联网技术的发展和教育信息化建设的大力需求,我国中小学纷纷建立了学校网站作为宣传和交流的载体和平台。面对学校网站上频繁更新的海量新闻报道,从海量数据中快速有效地发现教育信息化相关话题并进行持续跟踪是当下研究的热点。本文在话题发现的基础上提出了一个可以有效处理大量数据的挖掘信息流中潜在知识的教育信息化话题发现系统。该系统主要包括教育信息化本地话题检测和话题发现两部分。本地话题检测是采用模式匹配的方式对教育信息化相关话题的过滤过程,而话题发现是对本地话题进行增量式层次聚类的过程,该过程将潜在知识表示成一个具有层次的话题,每个话题包含一系列的与之相关的文档。本文的主要研究工作包括以下内容:1.解决了非结构化大量数据的采集和存储问题。网页数据具有更新频繁、数量庞大等特点,本文通过Hadoop分布式集群的搭建和对网络爬虫Nutch的二次开发很好解决了这一问题。分布式集群和Nutch的结合很好的解决了数据采集速度的难题,HBase分布式数据库的应用使大量无结构的网页数据的存储变得简单。2.提出了一种针对中小学网站的信息抽取方法。本文针对中小学网站页面的结构特点综合利用开源工具包Jsoup、模式匹配和行块分布函数开发了抽取网页信息的方法。Jsoup主要用于提取网页中的标签信息,如title、keywords、description等;模式匹配主要用于抽取网页的发布时间;行块分布函数的作用是提取网页正文。同时,将抽取的信息为每个网页建立一个Java类。3.对MapReduce分布式编程模型进行了深入研究和分析。为了解决大量数据计算问题,将TF-IDF计算公式、余弦夹角和聚类算法重新设计使其可以运行在MapReduce编程模型上,为整个话题发现过程奠定了基础。最后,针对中小学网站和中国教育信息化网站数据进行了实验,并对实验结果从话题的时间频率和话题内容变化趋势上进行分析。实验结果表明中小学网站中教育信息化相关话题与中国教育信息化网站相比在时间点上稍微有延迟,同时话题的内容也较为分散但整体发展趋势是一致的,这也表明本文提出的方法是行之有效的。
[Abstract]:Educational informatization is an important symbol of the development of education in a country and region.With the development of Internet technology and the great demand of educational information construction, primary and secondary schools in China have established school websites as a carrier and platform for propaganda and communication.In the face of the frequent updates of mass news reports on the school website, it is a hot topic to quickly and effectively discover the educational information related topics from the massive data and continue to track them.On the basis of topic discovery, this paper proposes a topic discovery system for educational informatization, which can effectively deal with a large amount of data and mine the potential knowledge in the information flow.The system mainly includes two parts: local topic detection and topic discovery.Local topic detection is a filtering process of educational information related topics by pattern matching, and topic discovery is a process of incremental hierarchical clustering of local topics, which represents potential knowledge as a hierarchical topic.Each topic contains a series of related documents.The main research work of this paper includes the following contents: 1. 1.The problem of collecting and storing large amount of unstructured data is solved.The web page data has the characteristics of frequent updating and large quantity. This paper solves this problem very well through the construction of Hadoop distributed cluster and the secondary development of Nutch, a web crawler.The combination of distributed cluster and Nutch solves the difficult problem of data acquisition speed. The application of HBase distributed database makes the storage of large amount of unstructured web page data easy. 2.This paper presents a method of information extraction for primary and secondary school websites.According to the structural characteristics of primary and secondary school web pages, this paper develops a method of extracting web page information by using open source toolkits Jsoup, pattern matching and line block distribution function. Jsoup is mainly used to extract tag information from web pages, such as titlenkeywordsdescription, etc.Pattern matching is mainly used to extract the publishing time of web pages, and the function of row block distribution function is to extract the text of web pages.At the same time, the extracted information will be created for each web page a Java class. 3. 3.The distributed programming model of MapReduce is deeply studied and analyzed.In order to solve the problem of large amount of data calculation, the TF-IDF formula, cosine angle and clustering algorithm are redesigned to run on the MapReduce programming model, which lays the foundation for the whole topic discovery process.Finally, the data of primary and secondary school websites and Chinese educational information websites are tested, and the experimental results are analyzed from the time frequency of topics and the changing trend of topic content.The experimental results show that there is a slight delay in the time point between the educational informatization related topics in the primary and secondary school websites and the Chinese educational informatization websites. At the same time, the content of the topics is more scattered but the overall development trend is consistent.It also shows that the proposed method is effective.
【学位授予单位】:南京师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:G434
,
本文编号:1759294
本文链接:https://www.wllwen.com/jiaoyulunwen/jiaoyutizhilunwen/1759294.html