基于Hadoop的中医药Web信息资源评价体系研究

发布时间：2018-03-06 09:30

本文选题：中医药　切入点：Web　出处：《山东中医药大学》2016年博士论文　论文类型：学位论文

【摘要】：随着计算机和通讯技术的发展,Internet(互联网)逐渐渗透到人们生产、生活的各个领域,成为人们重要的知识来源,人们不断的从网上获取信息用来指导自己的工作和生活,现代社会已经每时每刻都离不开互联网。Web,指的是Internet上与HTML相关的部分,即基于HTML协议的信息资源页面。Web上的中医药信息资源每天都在不断的增长,已经存在的资源也在不断的发生着变化和更新,信息技术的快速发展使得Web上的中医药信息资源相关数据呈爆炸式增长,但这些不断增长的中医药信息质量良莠不齐,并且在现有的情况下很难有一套相对完善的方法对中医药信息资源的质量进行客观的评价,并指导人们从大量的中医药信息资源中找到正确的、对自己有用的信息。因此,我们需要一种方法,能够对目前Web上存在的中医药信息资源进行客观的评价。论文从Web中医药信息资源特点出发,使用Hadoop分布式计算技术,提出基于数据辅助的德尔菲法与AHP(Analytic Hierarchy Process,即层次分析法)建立中医药Web信息资源评价指标体系,并针对中医药健康服务类网站进行了实证研究。主要研究成果包括以下几个方面:(1)中医药主题爬虫的设计。(第3章)讨论了Web中医药信息资源具有增速快、分布广、易变化的特点,如果要对Web上存在的中医药信息资源进行分析和评价,前提是能够以廉价、快速、高质量的方法获取信息,因此应使用自动化的Web信息获取方式,即使用网络爬虫对中医药Web信息进行自动爬取。同时,该爬虫与通用搜索引擎的爬虫有所区别,只针对以中医药为主题的网站进行爬取,避免浪费爬虫时间,从而提高爬取目标的准确率。因此针对上述要求,确定了中医药主题爬虫分布式、可伸缩、高性能、高质量的爬取目标,制定相应的爬取策略,并对爬虫进行开发。(2)中医药信息资源的Hadoop平台搭建。(第3章、第6章)爬取到的中医药Web相关主题页面内容,由于范围广泛、需要定期不断的进行数据更新,同时在进行页面分析和数据挖掘时,使用单机的分析策略,对单机的性能带来很高的要求,因此使用单机关系数据库的存储方式,不能满足高性能的计算要求,因此,在爬虫爬取到页面后,使用Hadoop的HDFS进行存储,在后期对现有网页内容的文本挖掘、统计分析上,都能够保证高性能和低系统开销。(3)中医药Web信息资源评价指标体系的构建。(第4章、第5章)从中医药Web信息资源特点入手,探讨了针对Web中医药信息资源评价的原则,对评价指标体系进行了构建。整个评价指标体系共分为四个大的部分,即信息内容评价、网站设计评价、易用性评价和其他评价。每个部分又细分了具体的二级指标,总共24项,并详细说明了这24项评价指标的意义和作用。进而对基于AHP层次分析法的中医药信息资源评价进行了分析,建立判断矩阵,确定指标体系具体指标的权重,并进行一致性检验。根据权重的比较,确定中医药Web信息资源评价中各个指标的重要性程度。(4)基于数据分析的中医药Web信息资源评价实施(第6章)以具体的中医药网站评价实务为例,从搭建分析环境开始,包括对于软硬件的配置要求、系统架构、Hadoop集群搭建等都进行了详细的说明。并解释了相关Map Reduce算法设计与实现,阐述了对网站进行分类、打分评价的具体实施过程。并指出了基于该评价,网站应做的改进。
[Abstract]:With the development of computer and communication technology, Internet (Internet) has gradually penetrated into people's production and life in all areas, become an important source of knowledge, people from the Internet to obtain information to guide their work and life, modern society has all the time, all cannot do without the Internet.Web, refers to the Internet and HTML related parts, namely Chinese medicine information resources day HTML protocol based on.Web page information resources are growing, existing resources are constantly changing and updating, the rapid development of information technology makes the traditional Chinese medicine information resources related data on the Web is growing explosively, but the traditional Chinese medicine the growing information quality uneven in quality, and in the existing situation is very difficult to assess the quality of a relatively perfect method of traditional Chinese medicine information resources, and To guide people from Chinese medicine information resources found in the correct and useful information on their own. Therefore, we need a method to objectively assess TCM information resources exist on the Web at present. From the characteristics of information resources of traditional Chinese medicine of Web, using the Hadoop distributed computing technology, put forward Delphy Fa and AHP based on the data aided (Analytic Hierarchy Process, the analytic hierarchy process) to establish the evaluation index system of traditional Chinese medicine Web information resources, and makes an empirical research on Chinese medicine health service website. The main research results as follows: (1) the design of traditional Chinese medicine topic crawler. (Chapter third) discusses the Web of traditional Chinese medicine the medicine information resource with fast growth, wide distribution, easy to change, if you want to analyze and evaluate the traditional Chinese medicine information resource on the Web, the premise is to cheap, fast, high quality The method of obtaining information, so should the use of automated Web information retrieval method, namely the use of web crawler on traditional Chinese medicine Web information automatic crawling. At the same time, the difference of the reptiles and the general search engine crawler, only for the traditional Chinese medicine as the theme of the web crawling, avoid the waste of time so as to improve the accuracy of the crawler. Rate of climb from the target. So based on the above requirements, determine the TCM topical crawler distributed, scalable, high performance, high quality crawling target, formulate the corresponding crawling strategy, and the development of reptiles. (2) Chinese medicine information resources of the Hadoop platform. (Chapter third, chapter sixth) to take up the Chinese medicine Web topic page content, because of the extensive range, need to regularly update the data at the same time, page analysis and data mining, analysis of strategy use single, to bring high performance single Storage requirements, so the use of stand-alone database, can not meet the requirements of high performance computing, therefore, in the crawler crawl page, use Hadoop HDFS for storage, mining in the late of the existing web content text, statistical analysis, can ensure the high performance and low system overhead construction (3). The evaluation index system of Web information resources of traditional Chinese medicine. (Chapter fourth, chapter fifth) starting from the characteristics of information resources of traditional Chinese medicine Web, Chinese medicine Web on information resources evaluation principle, the evaluation index system was constructed. The evaluation index system is divided into four parts, namely information content evaluation website design, evaluation, usability evaluation and other evaluation. Each part is divided two levels of specific indicators, a total of 24 items, and a detailed description of the meaning and function of these 24 evaluation indexes. Then the analysis method based on AHP levels. The analysis of medical information resource evaluation, establish judgment matrix, determining the index weight of the index system, and consistency checking. According to the weight of the comparison, determine the degree of importance of each index of traditional Chinese medicine Web information resources evaluation. (4) in the evaluation of the implementation of pharmaceutical Web information resources based on data analysis (Chapter sixth) to TCM site specific evaluation practice, starting from the analysis of constructing the environment, including the software and hardware configuration requirements, system architecture, Hadoop cluster are discussed in detail. And explain the design and implementation of Map Reduce algorithm, describes the classification of the site, the specific implementation process and evaluation. Pointed out based on the evaluation, improve the site should be done.

【学位授予单位】：山东中医药大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP393.09;R2-03

【相似文献】