弓形虫Rhomboid基因重组卡介苗的研制
发布时间:2018-05-08 09:15
本文选题:垂直搜索引擎 + Lucene ; 参考:《吉林大学》2012年硕士论文
【摘要】:随着互联网迅速发展至今,搜索引擎的出现可谓是必然的。偌大的互联网就好像一个巨型的图书馆,在这个网络图书馆里存在着,并且时时刻刻都在产生着大量的信息。数以万计的信息远超出了我们的想象与掌控,如果没有搜索引擎的出现,也许我们根本无法找到我们想要的目标信息。 网页数据抓取指的是批量、快速从网站上提取信息的一种计算机软件技术。网页数据抓取程序模拟浏览器的行为,能将可以在浏览器上显示的任何数据提取出来,,网页数据抓取的最终目的是将非结构化的信息从大量的网页中抽取出来以结构化的方式存储。传统搜索技术的如下缺陷使其很难满足用户的需要: 首先,这种技术对于关键字的选择要求很高,如果所选关键字不当,这样制约了非成熟用户使用搜索引擎。其次,这种搜索引擎在结果页面上能够显示的结果也非常有限,结果单一,通常充满了冗余的信息。造成这种结果的原因是由于这种技术是一种简单的基于一维关键字的查询,搜索引擎并不主动去“理解”文档,只是被动的进行关键字匹配。这种技术的结果导致了用户常常不能够获取有价值的信息。这种情况在时效性较强,以及信息结构化比较强的求职领域尤其明显。 互联网的信息冗余太过庞大,一篇文章被人转载成百上千次。虽然就目前的技术来讲有一定的识别技术,但是仍然显的比较无力。 垂直搜索简单点说,就是相对于通用搜索引擎对于特定行业的专业搜索引擎,是对专业网页库中得信息进行细化、整合、分类,抽取特定数据返回给客户,抓取的是的结构化数据和元数据,这也是和通用搜索存在的最大差别,通常由抓取系统,索引系统和搜索系统三大部分组成。 本论文对垂直搜索引擎的发展及在发展中面临的问题进行了理论性的分析,介绍了垂直搜索系统的关键技术,具体介绍了垂直搜索引擎的分类及相关知识。对网络蜘蛛的运行规则进行设计,提出了教育信息垂直搜索引擎系统的框架,分析了各部分功能模块的作用,给出了教育信息垂直搜索引擎系统的体系结构,构建了系统的处理流程,详细研究了教育信息垂直搜索引擎系统的框架中涉及的信息抓取、中文抽取和检索功能的实现。对管理模块、页面抓取、数据处理以及建立索引等进行的设计,实现对教育领域信息的垂直搜索框架的构造。给出了系统体系架构,设定了系统的处理流程,从整体结构,前端、后端分别标明处理过程,最后给出了UML用例分析。
[Abstract]:With the rapid development of the Internet, the emergence of search engines is inevitable. The huge Internet is like a huge library, in which a lot of information is produced all the time. Tens of thousands of messages are beyond our imagination and control. Without search engines, we might not be able to find the information we want. Web data capture refers to a computer software technology that can extract information from websites quickly and in batches. The webpage data grab program simulates the behavior of the browser and extracts any data that can be displayed on the browser. The ultimate purpose of web page data capture is to extract unstructured information from a large number of web pages and store it in a structured way. The following shortcomings of traditional search technology make it difficult to meet the needs of users: First of all, this technique requires a high level of keyword selection. If the keyword is not selected properly, it restricts the immature users to use search engines. Second, the search engine can display very limited results on the results page, the results are single, often full of redundant information. The reason for this result is that this technique is a simple query based on one-dimensional keywords, the search engine does not actively "understand" the document, but only passively carries out keyword matching. The result of this technology is that users are often unable to access valuable information. This situation is more effective in the field of job search, and information is more structured. The information redundancy of the Internet is so huge that an article is reproduced hundreds of times. Although there is a certain recognition technology in terms of current technology, but it is still relatively weak. To put it simply, vertical search is to refine, integrate, classify, and extract specific data to return to customers, as opposed to the general search engines for specialized search engines in specific industries. What is captured is structured data and metadata, which is also the biggest difference from general search. It usually consists of three parts: grab system, index system and search system. In this paper, the development of vertical search engine and the problems in the development are analyzed theoretically, the key technologies of vertical search system are introduced, and the classification and related knowledge of vertical search engine are introduced in detail. This paper designs the running rules of the web spider, puts forward the framework of the vertical search engine system of educational information, analyzes the function of each part of the function module, and gives the system structure of the vertical search engine system of education information. The processing flow of the system is constructed, and the realization of the functions of information capture, Chinese extraction and retrieval in the framework of the vertical search engine system of educational information is studied in detail. The design of management module, page capture, data processing and indexing is carried out to construct the vertical search framework for information in the field of education. The architecture of the system is given, and the processing flow is set up. The processing process is indicated from the whole structure, the front end and the back end. Finally, the UML use case analysis is given.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期
2 刘彦平;;关于网络搜索引擎及其优化的讨论[J];电子商务;2011年04期
3 李学勇,欧阳柳波,李国徽,钟敏娟;网络蜘蛛搜索策略比较研究[J];计算机工程与应用;2004年04期
4 万红新;彭云;;模糊策略下的搜索文本聚类分析技术[J];计算机工程与应用;2009年33期
5 陈红涛;杨放春;陈磊;;基于大规模中文搜索引擎的搜索日志挖掘[J];计算机应用研究;2008年06期
6 姚咏梅;;巧用目录式搜索引擎[J];科学大众;2009年07期
7 吴美清,沈惠玉;元搜索引擎在解决网络信息检索问题上所具有的优势与不足[J];情报杂志;2004年08期
8 翁R土
本文编号:1860819
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1860819.html