实时搜索引擎中时间信息的获取及简单应用

发布时间：2018-03-22 12:16

本文选题：实时搜索引擎　切入点：页面时间信息　出处：《吉林大学》2012年硕士论文　论文类型：学位论文

【摘要】：近年来，社会化网络（SNS）以及微博（MicroBlog）从现代互联网技术中脱颖而出，在很短的时间内以极快的速度发展着。这些网络社区吸引了数量庞大的用户群体，用户在社区当中可以自由的、随时随地的发表信息。另一方面，传统新闻报纸媒体也逐渐开始向电子新闻媒体转型，所发生事件的相关新闻报道也通过网络越来越及时的出现在互联网上。对于这两种类型的信息，用户如何能够及时的、快速的、准确的去获取到呢？我们知道，用户获取网络上信息的最常用的方式是通过搜索引擎来进行检索，用户输入所需要获取信息的关键词（KeyWord），搜索引擎会在索引数据库中查找那些相关的网页信息，并将查询得到的结果按照一定的规则经过排序之后取其中的Top-K返回给用户。那么用户是否可以通过传统的搜索引擎来获取SNS信息、微博信息以及最新的新闻报道信息呢？由于这些类型的信息都是每时每刻实时产生的，当这些信息在网络上出现时，传统的搜索引擎并不能够及时的对这些新产生的信息进行索引并加入到索引数据库中，所以传统的搜索引擎无法满足用户对这些实时信息进行检索的需求。这种对实时类信息进行检索的需求，就促进了实时搜索引擎的快速发展。实时搜索引擎是近年来迅速发展起来的，它就是为检索SNS信息、微博信息、新闻类信息提供实时检索服务的。在实时搜索当中，如何获得页面的时间要素信息是其提供实时检索服务的关键。这些时间要素信息主要包括：网页页面的创建时间，网页页面的更新时间，以及网页内容的更新周期。而要获得网页页面的这些时间信息，就必须对网页页面进行一些处理，以去除页面上的与页面主要内容无关的部分，然后从页面所包含的主要内容当中或者页面当中的链接关系来获得该页面的相关的时间信息。文中在对这些类型的实时信息页面进行分析的过程中，发现了这类信息页面的主要内容一般只由一个独立的内容块构成，而且这些页面的主要内容在经过自然语言处理技术处理之后，还在语义上表现出很有规律的词性特征。在W3C提出的DOM树模型的基础上，本文利用页面内容在语义上所表现出来的这些词性特征，以及HTML标记本身的可视化信息，提出了一种提取页面的主要内容、对页面进行重构的算法SemV。相关实验表明，SemV算法能有有效的、准确的识别出新闻页面的主要内容，而且还可以有效地减少保存页面所需要的存储空间，节约了硬件资源。在提取出页面的主要内容、对页面进行重构的基础上，根据表示页面的时间信息的词汇或者短语的分布情况以及时间信息出现的模式特征，提取得到页面内容当中所包含的表示时间信息的词汇和短语，然后依据这些表示时间信息的词汇和短语来估计获得页面的相关时间信息。在对新闻报道信息之间的联系进行语义分析的过程中，文中发现关于同一事件的新闻报道信息具有很强的联系，它们都以事件为中心。在对它们之间的这种联系进行分析的基础上，文中提出了新闻事件对象模型以及基于该模型对页面时间要素信息进行估计的方法EOM。相关的试验验证了该模型的可行性和准确性，，对于新闻报道类信息以及和新闻报道相关的微博、社交网络信息，该模型和方法都具有较好的效果。在获得页面的时间要素信息之后，文中对比分析了实时搜索引擎当中爬虫对页面进行重新爬行以获取页面更新信息内容的两种方法：基于自然顺序的方法和基于网页重要性程度的方法。最后结合页面的更新时间、更新周期以及页面的重要性程度，提出了一种基于贪心策略的方法：根据页面的更新时间和更新周期信息来指导爬虫爬行更新信息。贪心策略体现在具有最短更新周期的页面爬虫优先进行爬行。该调度策略方法，可以高效的分配爬虫的有限的硬件和网络带宽资源，及时的获取得到页面上的更新信息内容，有效的提高爬虫的工作效率，降低爬虫和服务器的负载。最后，由于实时搜索引擎是最近一段时间才发展起来的，所以其中还存在很多的需要我们去解决的问题，文中给出了一些需要进行进一步研究解决的问题并指出了后续研究工作的方向。
[Abstract]:In recent years, social networking (SNS) and micro-blog (MicroBlog) from the modern Internet technology talent shows itself, in a very short period of time in order to speed the development of the network. The community has attracted a huge number of user groups, users can freely in the community, everywhere published information. On the other hand, the traditional newspaper media are also beginning to transition to the electronic news media, news events through the network more and more timely appear on the Internet.
For the two types of information, how users can timely, fast, accurate to get to? We know that the user access to the network information is the most common way is through the search engine to search keywords, the user input needed to obtain information (KeyWord), the search engine will search for relevant web pages in the index database, and the query results according to certain rules sorted after the Top-K to return to the user. Then the user can SNS information is acquired through the traditional search engine, micro-blog information and latest news information? These types of information are generated in real time all the time and when these messages appear on the network, the traditional search engine can not be timely for these new information are indexed and added to the index database, the In the traditional search engine cannot meet user retrieval needs of these real-time information. The retrieval needs of real-time information, will promote the rapid development of real-time search engine. A real-time search engine is developed rapidly in recent years, it is for the retrieval of SNS information, micro-blog information, news information to provide real-time retrieval service.
In real-time search, how to get the information of the time factor is the key to provide real-time page retrieval service. These factors include: time information web page creation time, update time of web page, update cycle and content on the web. The time information and to obtain the web page, there must be some processing on the web page in order to remove the page, page and page main content independent parts, main contents and from the page contains links among pages or to obtain relevant information of the time of the page.
In this paper, in the process of real-time information page for these types of analysis, found the main content of this kind of information page only by an independent content blocks, and the main content of these pages after Natural Language Processing technology, still shows some semantic and POS features are based on W3C rules. The DOM tree model, the page content displayed in the meaning out of these speech features and HTML labeled visual information itself, and put forward the main content of a page extraction algorithm of SemV., shows that the related experiments in page reconstruction, SemV algorithm can effectively and accurately identify the the main content of news pages, but also can effectively reduce the storage space needed to save the page, save the hardware resources.
To extract the main content of the page, based on the reconstruction of the page, according to the distribution characteristics of time mode information page words or phrases and time information, extracted from the page content contains time information representing words and phrases, and then on the basis of the time information representing words and phrases. The estimated time information page. The process of semantic analysis in the relationship between news report information, this paper found that the news reports on the same event information has strong ties, they are event centered. Based on the analysis of the relationship between them, is proposed in this paper. The news event object model and based on the time information of the page elements model test EOM. estimation method to verify the feasibility of the model and accurate Sex, for news reports and news reports related to micro-blog, social network information, the model and methods have good results.
After the information time factor of the page, the comparative analysis of the real-time search engine crawler on page re crawling two methods to obtain the page update content: a method based on the natural order and method based on "the importance. Finally, the update time of the page, the update cycle and the degree of importance of the page, put forward a method based on greedy strategy: according to page update time and update cycle information to guide the crawler crawl to update the information. The greedy strategy is showed in the shortest period to update the page crawl crawl. The priority scheduling strategy, the crawler can be assigned efficiently with limited hardware and network bandwidth resources, to get the update the information content on the page in a timely manner, effectively improve the crawler work efficiency, reduce the load crawler and the server.
Finally, due to the fact that the real-time search engine has been developing for a long time, there are still many problems that we need to solve. In this paper, some problems that need further research and solutions are pointed out, and the direction for further research is pointed out.

【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【相似文献】