基于多深度模型集成的音频场景分类方法研究
发布时间:2018-03-14 12:15
本文选题:音频场景分类 切入点:深度学习 出处:《哈尔滨工业大学》2017年硕士论文 论文类型:学位论文
【摘要】:音频场景分类(Acoustic Scene Classification,ASC)是计算机听觉场景分析(Computational Auditory Scene Analysis,CASA)领域的一种特定任务,它根据音频流的声学内容,识别其所对应的特定场景语义标签,进而达到感知和理解周边环境的目的。与致力于理解人类感知音频场景机制的心理学研究不同,音频场景识别主要依赖信号处理技术和机器学习方法实现自动识别音频场景。传统的ASC任务,主要针对单个场景进行特征提取和分类器选择。随着音频采集设备的迅猛发展,各种各样的音频数据被大量收集,传统的信号处理技术和识别方法面临着重大的挑战,急需研究新的技术改善现状。为了充分的利用繁多的音频场景数据,本文尝试了各种深度学习方法,如多层感知机(Multi-Layer Perceptron,MLP)、卷积神经网络(Convolutional Neural Network,CNN)、长短时循环神经网络(Long Short-Term Memory,LSTM)等。首先,提取音频的帧级特征,包括:梅尔频率倒谱系数MFCC(Mel-Frequency Cepstral Coefficients,MFCC)特征和对数梅尔谱(Log-Mel Spectrogram)特征,然后将音频帧拼接成段特征,输入到深度学习模型进行识别分类。为了改善基于LSTM模型的ASC系统,本文提出了一种基于乱序自助采样法的段处理技术。这种段处理技术不仅可以模拟复杂的时序组合关系,而且可以扩大训练数据规模,从而使模型的泛化能力更强。为了改善基于MLP模型的ASC方法,本文在模型结构中引入了Attention机制。通过引入Attention机制,可以突破数据全局表征的局限,更关注数据的关键部分。同时,Attention机制能很好的处理去耦合问题,即用不同的特征空间来描述不同的场景。不同种类的深度学习方法对不同场景的识别能力不同,如MLP能很好的识别的场景是沙滩、居民区,而CNN更易区分图书馆、公交车。而集成学习通过将多个学习器进行结合,常可获得比单一学习器显著优越的泛化性能。所以,为了集成各种分类器的在不同场景上的识别优势,本文采用了各种集成学习融合方法,其中基于BAGGING(Bootstrap AGGregat ING)框架的集成选择方法,使得ASC任务的分类性能得到了明显提升。
[Abstract]:Audio scene classification is a specific task in the field of computer auditory Auditory Scene Analysis (CASA), which recognizes the semantic labels of specific scenes according to the acoustic content of audio streams. To achieve the purpose of perceiving and understanding the surrounding environment, as opposed to the psychological research devoted to understanding the mechanism of human perception of audio scene, Audio scene recognition mainly depends on signal processing technology and machine learning method to realize automatic recognition of audio scene. Traditional ASC task mainly focuses on feature extraction and classifier selection for a single scene. All kinds of audio data are collected in large quantities. Traditional signal processing technology and recognition methods are facing great challenges. It is urgent to study new technologies to improve the current situation. In order to make full use of a wide range of audio scene data, This paper attempts various depth learning methods, such as Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), long Short-Term memory (LSTM) and so on. Firstly, the frame level features of audio frequency are extracted. It includes: Mel frequency cepstrum coefficient MFCC(Mel-Frequency Cepstral coefficients (MFCC) feature and log-Mel spectrum Log-Mel spectrogramgram feature. Then the audio frames are spliced into segment features and input into the depth learning model for recognition and classification. In order to improve the ASC system based on LSTM model, In this paper, a segment processing technique based on out-of-order self-help sampling method is proposed, which can not only simulate complex time series combination relations, but also enlarge the scale of training data. In order to improve the ASC method based on MLP model, the Attention mechanism is introduced into the model structure. By introducing the Attention mechanism, the limitation of global representation of data can be broken through. At the same time, the attention mechanism can deal with the decoupling problem well, that is to say, different feature spaces are used to describe different scenarios. Different kinds of depth learning methods have different recognition ability for different scenes. For example, MLP can well identify the scenes of beach and residential areas, while CNN is easier to distinguish between libraries and buses. Integrated learning can often achieve significantly better generalization performance than a single learner by combining multiple learning devices. In order to integrate the recognition advantages of various classifiers in different scenes, this paper adopts a variety of integrated learning fusion methods, in which the ensemble selection method based on BAGGING(Bootstrap AGGregat frame makes the classification performance of ASC tasks obviously improved.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP18;TN912.3
【参考文献】
相关硕士学位论文 前2条
1 史秋莹;基于深度学习和迁移学习的环境声音识别[D];哈尔滨工业大学;2016年
2 陈晨;I-VECTOR说话人识别中基于偏最小二乘的总变化空间估计方法[D];哈尔滨工业大学;2015年
,本文编号:1611162
本文链接:https://www.wllwen.com/kejilunwen/xinxigongchenglunwen/1611162.html