基于文本分析的互联网视频搜索引擎技术研究
发布时间:2018-05-11 01:39
本文选题:视频搜索引擎 + 中文分词 ; 参考:《杭州电子科技大学》2013年硕士论文
【摘要】:随着网络技术的飞速发展,互联网上的信息不仅在数量上以几何级的速度增长,在形式上也变得多种多样。多媒体信息正在逐步的取代传统的文本信息,成为人们上网了解信息的第一选择。传统的搜索引擎专注于文字的搜索,对视频、图片等多媒体信息搜索的支持远远满足不了人们的需求。针对这种情况,本文设计了一种专门针对于互联网视频的搜索引擎,该搜索引擎通过分析挖掘视频的标题,评论等相关文本信息能较精确的搜索到视频信息,并通过分析用户日志提供个性化搜索。 本文首先介绍了网络爬虫的实现原理及运行过程。该网络爬虫针对视频网站爬取视频相关文本信息,,并将相关信息保存到本地。由于具有采集速度快,范围广的特点,使其能较好的满足用户对搜索引擎查找速度快,搜索范围广的要求。接下来,本文介绍通过对已有视频文本信息的分析挖掘而非对视频本身的分析来间接获取视频内容信息。介绍了中文分词的主流算法,并比较了这些方法的优缺点,并详细实现了正向最大匹配算法,为之后句子间相似度匹配算法提供了较好的分词效果。接着介绍对爬虫爬取到的视频评论进行过滤的方法,将情感评论,垃圾评论等对分析视频内容无关的评论过滤掉。采用计算相对词频来分析文本信息进而判断视频的内容。 然后,详细介绍了根据用户日志判断用户查询意图的方法。首先介绍了用户日志挖掘的过程,详细描述了怎样对用户日志进行处理,并以Sogou用户日志为例进行处理,获得了满足后续分析的数据。提出一种基于句子相似度计算判断用户查询意图的方法,该方法根据用户日志判断查询词与哪类视频的相关度最大来确定用户的查询意图。 最后,分别用实验验证了网络爬虫的爬取效果,垃圾评论过滤,句子相似度匹配算法的正确性和可行性,并将这些功能有机的结合在一起实现了一个面向互联网视频的个性化搜索引擎系统。
[Abstract]:With the rapid development of network technology, the information on the Internet not only grows at the rate of geometry, but also becomes diversified in form. Multimedia information is gradually replacing the traditional text information, becoming the first choice for people to understand information online. Traditional search engines focus on text search and support multimedia information search, such as video, pictures, etc. In order to solve this problem, this paper designs a search engine for Internet video. The search engine can search the video information accurately by analyzing and mining the titles, comments and other related text information of the video. And through the analysis of user logs to provide personalized search. This paper first introduces the implementation principle and running process of network crawler. The web crawler crawls the video related text information to the video website and saves the relevant information to the local. Because of its fast acquisition speed and wide range, it can better meet the requirements of search engine search speed and search range. Then, this paper introduces how to obtain the video content information indirectly by mining the existing video text information rather than analyzing the video itself. This paper introduces the mainstream algorithms of Chinese word segmentation, compares the advantages and disadvantages of these methods, and implements the forward maximum matching algorithm in detail, which provides a good segmentation effect for the subsequent sentence similarity matching algorithm. Then it introduces the method of filtering the video comments crawled by the crawler, filtering out the comments that have nothing to do with the analysis of video content, such as emotional comments and spam comments. The relative word frequency is used to analyze the text information and to judge the content of the video. Then, the method of judging user's query intention according to user log is introduced in detail. Firstly, the process of user log mining is introduced, and how to process user log is described in detail. Taking Sogou user log as an example, the data satisfying the subsequent analysis are obtained. A method of judging user's query intention based on sentence similarity calculation is proposed. This method determines the user's query intention based on the user log's judgement of the maximum correlation between the query words and which kind of video. Finally, experiments are carried out to verify the correctness and feasibility of crawler crawling, spam filtering and sentence similarity matching algorithm. And the organic combination of these functions together to achieve a personalized search engine system for Internet video.
【学位授予单位】:杭州电子科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 王晟;赵壁芳;;基于内容的图片搜索引擎研究[J];长沙大学学报;2012年02期
2 王子玲;许爱强;牛双诚;陈育良;;一种建立复杂系统相关性矩阵的新方法[J];测试技术学报;2010年02期
3 杨思春;;一种改进的句子相似度计算模型[J];电子科技大学学报;2006年06期
4 谢峰;刘洪星;;基于Lucene的Web站内搜索引擎的研究[J];电脑知识与技术;2008年04期
5 侯亚丽,袁方;Web日志挖掘中的数据预处理技术[J];河北大学学报(自然科学版);2005年02期
6 陈桂林,王永成,韩客松,王刚;一种改进的快速分词算法[J];计算机研究与发展;2000年04期
7 陈耿,朱玉全,杨鹤标,陆介平,宋余庆,孙志挥;关联规则挖掘中若干关键技术的研究[J];计算机研究与发展;2005年10期
8 罗智勇;宋柔;;现代汉语通用分词系统中歧义切分的实用技术[J];计算机研究与发展;2006年06期
9 欧振猛,余顺争;中文分词算法在搜索引擎应用中的研究[J];计算机工程与应用;2000年08期
10 刘迁;贾惠波;;中文信息处理中自动分词技术的研究与展望[J];计算机工程与应用;2006年03期
本文编号:1871860
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1871860.html