农业视频语义描述算法的研究与实现

发布时间：2018-06-12 22:46

本文选题：视频检索 + 中文语义描述　；参考：《西北农林科技大学》2017年硕士论文

【摘要】：为了解决农业视频的语义索引不完善的问题,研究并实现农业视频语义描述算法,为农业视频生成描述其语义的自然语句作为农业视频的语义索引和内容梗概,从而实现基于语义关键字的农业视频检索和对检索结果的人工筛选,大大降低农业从业者检索具体农业生产活动相关视频的时间,有助于推动农业信息化的发展。农业视频语义描述面临着诸多困难,如怎样提取代表农业视频语义的语义关键帧、怎样识别语义关键帧中的物体及相对关系、怎样用自然语句表达语义关键帧的识别结果等,是一项涉及到计算机视觉和自然语言处理的跨学科难题。本文对农业视频语义描述的解决思路是:将农业视频按照画面过渡分割为镜头并为每个镜头提取语义关键帧,为语义关键帧提取图像特征并映射到含义空间,为人工对语义关键帧添加的语义描述提取文本特征并映射到含义空间,在含义空间使用递归神经网络学习语义关键帧生成语义描述,从而为任意语义关键帧生成语义描述。本文的主要工作如下:(1)语义关键帧的图像特征提取。为农业视频提取压缩关键帧,在压缩域基于直方图特征使用固定阈值的镜头边界检测算法将农业视频分割为镜头,使用K-Means聚类算法为镜头提取出语义关键帧;基于人工为语义关键帧添加的物体位置信息训练深度图像特征提取器,为语义关键帧提取深度图像特征。(2)语义描述的文本特征提取。为农业视频的语义关键帧人工添加语义描述,使用分词算法对语义描述进行分词操作并统计分词结果中的所有词汇构建初始中文词表;使用中文词汇相似度判定算法对初始中文词表中的同义词进行合并得到最终中文词表,将语义描述中的词汇序列相对于最终中文词表的索引序列作为语义描述的文本特征。(3)语义关键帧生成语义描述的学习。将语义关键帧的图像特征映射成含义空间的一个含义向量并编码入递归神经网络的隐藏层;将语义关键帧对应语义描述的文本特征映射成含义空间的一组含义向量作为递归神经网络隐藏层的解码输入,根据训练数据集中的语义关键帧和语义描述学习递归神经网络的编码矩阵和解码矩阵。本文的主要创新在于基于区域而不是基于整幅图像为语义关键帧提取图像特征、基于同义词而不是基于词汇为语义描述提取文本特征,在农事直通车数据集上的实验表明,这两种创新分别将农业视频语义描述的得分提高了5.1和1.7。
[Abstract]:In order to solve the problem of imperfect semantic index of agricultural video, the semantic description algorithm of agricultural video is studied and implemented, and the natural sentence describing its semantics is generated for agricultural video as the semantic index and content outline of agricultural video. Therefore, agricultural video retrieval based on semantic keywords and manual selection of retrieval results can greatly reduce the time for agricultural practitioners to retrieve videos related to specific agricultural production activities, and help to promote the development of agricultural informatization. Agricultural video semantic description faces many difficulties, such as how to extract semantic key frames representing agricultural video semantics, how to identify objects and relative relations in semantic key frames, how to express the recognition results of semantic key frames with natural sentences, etc. Is a cross-disciplinary problem involving computer vision and natural language processing. In this paper, the solution to the semantic description of agricultural video is to divide the agricultural video into shots according to the picture transition and extract the semantic key frames for each shot, and extract the image features for the semantic key frames and map them to the meaning space. Text features are extracted and mapped to the meaning space for the semantic description added to the semantic key frame, and the semantic description is generated for any semantic key frame by using recursive neural network to generate the semantic description in the meaning space. The main work of this paper is as follows: 1) feature extraction of semantic key frames. In order to extract and compress key frames of agricultural video, the shot boundary detection algorithm based on histogram feature is used to segment agricultural video into shots in compressed domain, and K-Means clustering algorithm is used to extract semantic key frames for shots. A depth image feature extractor is used to extract depth image features for semantic key frames based on the training of object position information for semantic key frames. The semantic key frame of agricultural video is artificially added semantic description. The segmentation algorithm is used to segment the semantic description and all the words in the segmentation result are counted to construct the initial Chinese word list. The final Chinese thesaurus is obtained by merging the synonyms in the initial Chinese thesaurus by using the Chinese lexical similarity determination algorithm. The lexical sequence in semantic description is compared with the index sequence of the final Chinese lexical table as the text feature of semantic description. The image feature of the semantic key frame is mapped into a meaning vector of the meaning space and encoded into the hidden layer of the recurrent neural network. The text feature of semantic key frame corresponding to semantic description is mapped into a set of meaning vectors in the meaning space as the decoding input of the hidden layer of recursive neural network. The coding matrix and decoding matrix of recurrent neural network are studied according to the semantic key frames and semantic description of the training data set. The main innovation of this paper is to extract the feature of the image based on the region rather than the whole image as the semantic key frame, and the text feature based on the synonym rather than the semantic description. The experiment on the through train data set shows that, The two innovations raised the scores of agricultural video semantic descriptions by 5. 1 and 1. 7, respectively.
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H030;S126;TP391.41

【参考文献】