Research information
My research interests focus on audio and speech coding, and many of the related topics from fundamental theory to implementation issues. In particular, I am looking at perceptual audio coding based on auditory model inversion, phase perception, and phase information coding and reconstruction.Some of my older work regarding perceptually motivated noise reduction can also be found here.
Most my research to date has been done with the MMSP (Multimedia Signal Processing) group of the Telecommunication and Signal Processing Lab at McGill University. More recently, I did a Post-Doc at INRIA/IRISA Rennes, working on Source Separation. Presently I am working as a Post-Doc at the Carl-von-Ossietzky Universität Oldenburg in the Acoustics group.
2017
"Pitch features for low-complexity online speaker tracking"
In a complex acoustic scene with many speakers, a hearing-aid user would benefit if an intelligent hearing aid could distinguish a target speaker from interfering speakers, such that the former can be enhanced and the latter can be suppressed. Thus, an intelligent hearing aid needs to be able to track speakers, to recognize that some target speaker identified previously is speaking again. This task is difficult in hearing aids since it is impossible to have specific speaker models and the acoustic conditions are likely to change, eg. due to head movements or changes in the location of the speaker. Previous research has shown that low-complexity speaker tracking can be obtained using spectral features, requiring speech fragments of at least 3 seconds to achieve reasonable tracking performance. In this research, we examine the additional use of pitch features in order to further reduce the duration of the required speech fragment, so that the hearing aid can be steered more rapidly in real-world scenarios.
Proc. DAGA 2017, Kiel, Germany
Link to paper."A distance measure to combine monaural and binaural auditory cues for sound source segregation"
In computer-based analysis of audio signals, the blind segregation of sound sources from a mixture of sounds remains a challenge. A common approach to solve this task is to divide the signal under consideration into small time-frequency units and extract the prominent features within each of these segments. Based on this information, the units are assigned to a sound source which is assumed to be dominating the regarded region. Past studies have shown that localization cues such as interaural phase and level differences can be used to enable the segregation. Since a human listener probably exploits a variety of cues to achieve the best performance in separating unknown sound sources, the algorithm presented here aims at integrating more than one cue type to improve the segregation. The algorithm is based on both pitch and localization cues which are combined using a distance measure. Applying machine-learning techniques, a distance measure is derived to reflect the probability of two adjacent time-frequency units being dominated by the same source. This distance measure is thereafter used to cluster segments and assign these clusters to a certain speaker by accumulating the information of source location across all time-frequency units within one cluster.
Proc. DAGA 2017, Kiel, Germany
Link to paper.2016
"Customized high performance low power processor for binaural speaker localization"
Proc. ICECS 2016, Monte Carlo, Monaco
One of the key problems for hearing impaired persons represents the cocktail party scenario, in which a bilateral conversation is surrounded by other speakers and noise sources. State-of-the-art beamforming techniques are able to segregate specific sound sources from the environment, presupposing the position of the speaker. The speaker position can be estimated in the frontal azimuth-plane with a probabilistic localization algorithm from the binaural microphone input of the both-eared hearing aid system. However, the binaural speaker localization requires computationally complex audio processing and filtering. The high computational complexity combined with low energy requirements to meet the battery constraints of hearing aid devices presents an implementation challenge. This paper proposes a customized C programmable processor design to implement the speaker localization algorithm that fulfills the challenging requirements placed by the usage context. When compared to a VLIW-based processor design with similar basic computational resources and no special instructions, the proposed processor reaches a 151x speed-up. For a 28nm standard CMOS technology, power consumption of 12 mW (at 50 MHz) and silicon area of 0.3 mm² is estimated. This is the first publication of a realistic programmable processing architecture for the probabilistic binaural speaker localization or a comparably complex algorithm for hearing aid devices. The algorithms supported by the previously proposed implementations are approximately 15x less computationally demanding.
Paper on IEEExplore."Speaker Tracking for Hearing Aids"
Proc. MLSP 2016, Vietri sul Mare, Salerno, Italy
Modern multi-microphone hearing aids employ spatial filtering algorithms capable of enhancing speakers from one direction whilst suppressing interfering speakers of other directions. In this context, it is useful to track moving speakers in the acoustic space by linking disjoint speech segments. Since the identity of the speakers is not known beforehand, the system must match short speech segments without having a specific speaker model or prior knowledge of the speech content, while ignoring changes in acoustic conditions. In this paper, we present a method that matches each speech segment to non-specific speaker models thereby obtaining an activation pattern, and then compares the patterns of disjoint speech segments to each other. The proposed method is low in computational complexity and memory footprint and uses mel-frequency cepstral coefficients (MFCCs) and Gaussian mixture models (GMMs). We find that, when using MFCCs as acoustic features, the proposed speaker tracking method is robust to changes in the acoustic environment provided that sufficiently large segments of speech are available.
Link to paper, and the poster."Probabilistic 2D localization of sound sources using a multichannel bilateral hearing aid"
Proc. DAGA 2016, Aachen, Germany
In the context of localization for Computational Auditory Scene Analysis (CASA), probabilistic localisation is a technique where a probability that a sound source is present is computed for each possible direction. This approach has been shown to work well with binaural signals provided the location of the sources to be localized is in front of the user and approximately on the same plane as the ears. Modern hearing aids use multiple microphones to perform array processing, and in a bilateral configuration, the extra microphones can be used by localization algorithms to not only estimate the horizontal direction (azimuth), but vertical direction (elevation) as well, thereby also resolving the front- back confusion. In this work, we present three different approaches to use Gaussian Mixture Model classifiers to localize sounds relative to a multi- microphone bilateral hearing aid. One approach is to divide a unit sphere into a nonuniform grid and assign a class to each grid point; the other two approaches estimate elevation and azimuth separately, using either a vertical-polar coordinate system or an ear- polar coordinate system. The benefits and drawbacks in terms of performance, computational complexity and memory requirements are discussed for each of these approaches.
Link to paper."Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene"
EURASIP Journal on Advances in Signal Processing 2016, 2016:12
Modern binaural hearing aids utilize multimicrophone speech enhancement algorithms to enhance signals in terms of signal-to-noise ratio, but they may distort the interaural cues that allow the user to localize sources, in particular, suppressed interfering sources or background noise. In this paper, we present a novel algorithm that enhances the target signal while aiming to maintain the correct spatial rendering of both the target signal as well as the background noise. We use a bimodal approach, where a signal-to-noise ratio (SNR) estimator controls a binary decision mask, switching between the output signals of a binaural minimum variance distortionless response (MVDR) beamformer and scaled reference microphone signals. We show that the proposed selective binaural beamformer (SBB) can enhance the target signal while maintaining the overall spatial rendering of the acoustic scene.
Link to paper (HTML). :: Link to paper (PDF). :: Supplemental Data (Files used for subjective evaluation).2015
"Features for Speaker Localization in Multichannel Bilateral Hearing Aids"
Proc. European Signal Processing Conf. (EUSIPCO) 2015
Modern hearing aids often contain multiple microphones to enable the use of spatial filtering techniques for signal enhancement. To steer the spatial filtering algorithm it is necessary to localize sources of interest, which can be intelligently achieved using computational auditory scene analysis (CASA). In this article, we describe a CASA system using a binaural auditory processing model that has been extended to six channels to allow reliable localization in both azimuth and elevation, thus also distinguishing between front and back. The features used to estimate the direction are one level difference and five inter-microphone time differences of arrival (TDOA). Initial experiments are presented that show the localization errors that can be expected with this set of features on a typical multichannel hearing aid in anechoic conditions with diffuse noise.
Link to Paper."Method and apparatus for sound source separation based on a binary activation model"
UK Patent GB 2510650, granted July 2015.
Link to patent."Multiple Model High-Spatial Resolution HRTF Measurements"
Proc. DAGA 2015, Nürnberg, Germany
The shape of human heads is not uniform, yet databases containing the head-related transfer functions (HRTFs) of more than just one head and torso simulator (HATS) are rare. Furthermore, in the past the coverage and resolution of spatial points from which the HRTF are measured was limited to ease distribution by physical media. In the course of ongoing research at the University of Oldenburg, we decided to create a database of HRTFs for three commercially available HATS as well as one custom made HATS. Using the two-arc-source-positioning system described by Brinkmann et al. (DAGA 2013), three of these HATS were measured with and without multichannel hearing aid simulators fitted. Our database uses a higher than typical spatial resolution of 2∘ in azimuth and elevation, with coverage from 64∘ below the horizon to the zenith (90∘). Grid points were omitted only near the zenith. In this work, we describe the details of the measurement setup and the post-measurement analysis methods. From the analysis, we describe the observed variability on the data due to the HATS and the measurement setup.
Link to paper."Tracking Tone Complexes in Audio Signals Using Structures across Time and Frequency"
Proc. DAGA 2015, Nürnberg, Germany
Most natural sounds, voices, and musical instruments produce modulated tone complexes. The modulation of these tone complexes is of vital interest for topics like melody extraction, speech recognition, and computational auditory scene analysis. In this work, we introduce a new approach to tracking the modulation of tone complexes. This algorithm, called Stretch-Correlation, tracks the modulation of tone complexes by comparing successive short-time spectra using resampling and spectral correlation. This algorithm is compared with two well-known base frequency estimators YIN and PEFAC, and is shown to outperform both at positive signal to noise ratios for both synthetic tone complexes and real instrument recordings.
Link to paper.2014
"A Binaural Hearing Aid Speech Enhancement Method Maintaining Spatial Awareness for the User"
Proc. European Signal Processing Conf. (EUSIPCO) 2014, Lisbon, Portugal
Multi-channel hearing aids can use directional algorithms to enhance speech signals based on their spatial location. In the case where a hearing aid user is fitted with a binaural hearing aid, it is important that the binaural cues are kept intact, such that the user does not loose spatial awareness, the ability to localize sounds, or the benefits of spatial unmasking. Typically algorithms focus on rendering the source of interest in the correct spatial location, but degrade all other source positions in the auditory scene. In this paper, we present an algorithm that uses a binary mask such that the target signal is enhanced but the background noise remains unmodified except for an attenuation. We also present two variations of the algorithm, and in initial evaluations find that this type of mask-based processing has promising performance.
Paper on IEEE Xplore."Heating Mechanism for DNA Amplification, Extraction or Sterilization Using Photo-Thermal Nanoparticles"
Canadian Patent Application CA 2821335 (US 2014170664), filed July 2014.
Link to patent."Spatial Properties of the DEMAND Noise Recordings"
Proc. DAGA 2014, Oldenburg, Germany
"Diverse Environment Multichannel Audio Noise Database" (DEMAND) is a collection of 18 recordings of environmental noise in a variety of indoor and outdoor environments. The database was recorded using a planar array of 16 microphones, arranged in 4 staggered rows. This diverse collection of noises allows users to test a variety of array signal processing algorithms with a large amount of realistic background noise. In this work, we examine the spatial and temporal properties of the recorded noises. Current online and offline microphone array calibration techniques are applied on the published data and compared to the design specification of the array. We compare the applicability of the calibration algorithms in the various noise environments, and how consistent the results of the algorithms are to each other.
Link to paper."Erhaltung der räumlichen Wahrnehmung bei Störgeräuschreduktion in Hörgeräten"
Proc. DAGA 2014, Oldenburg, Germany
Für Hörgeräteträger ist es wichtig, die akustische Szene möglichst originalgetreu zu erhalten, um ihnen die gewohnte räumliche Trennung der Schallquellen zu gewährleisten. In traditionellen Hörsystemen gehen diese räumlichen Informationen jedoch meist zugunsten einer effektiven Störgeräuschreduktion verloren. Es wurde daher eine Methode entwickelt, welche gleichermaßen die räumlichen Informationen erhält und eine Störgeräuschreduktion durchführt. Hierzu erfolgt zunächst eine Zuordnung der Zeitfrequenz-Anteile des Eingangssignals, wo entschieden wird, welche dieser Anteile zum Zielsignal (z.B. ein Sprecher) und welche Anteile zum Störgeräusch gehören. Dies ermöglicht eine Dämpfung der Störgeräusche, während das Zielsignal unverändert bleibt. Obwohl in diesem System Artefakte entstehen können, treten diese bei geeignet gewählten Parametern nicht auf. Diese Methode wurde in einer Evaluation mit ähnlichen Methoden verglichen, die dem gegenwärtigen Stand der Technik entsprechen.
Link to paper.2013
"An experimental comparison of source separation and beamforming techniques for microphone array signal enhancement"
Int. Workshop on Machine Learning for Sig. Proc. (MLSP) 2013, Southampton, UK
We consider the problem of separating one or more speech signals from a noisy background. Although blind source separation (BSS) and beamforming techniques have both been exploited in this context, the former have typically been applied to small microphone arrays and the latter to larger arrays. In this paper, we provide an experimental comparison of some established beamforming and post-filtering techniques on the one hand and modern BSS techniques involving advanced spectral models on the other hand. We analyze the results as a function of the number of microphones, the number of speakers and the input Signal-to-Noise Ratio (iSNR) w.r.t. multichannel real-world environmental noise recordings. The results of the comparison show that, provided that a suitable post-filter or spectral model is chosen, beamforming performs similar to BSS on average in the single-speaker case while in the two-speaker case BSS exceeds beamformer performance. Crucially, this claim holds independently of the number of microphones.
Paper on IEEE Xplore."A fast EM algorithm for Gaussian model-based source separation"
Proc. European Signal Processing Conf. (EUSIPCO) 2013 (Marrakech, Morocco), Sept. 2013.
We consider the FASST framework for audio source separation, which models the sources by full-rank spatial covariance matrices and multilevel nonnegative matrix factorization (NMF) spectra. The computational cost of the expectation-maximization (EM) algorithm in [Ozerov, 2012] greatly increases with the number of channels. We present alternative EM updates using discrete hidden variables which exhibit a smaller cost. We evaluate the results on mixtures of speech and real-world environmental noise taken from our DEMAND database. The proposed algorithm is several orders of magnitude faster and it provides better separation quality for two-channel mixtures in low input signal-to-noise ratio (iSNR) conditions.
Paper on IEEE Xplore."The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings"
Int. Conf. Acoustics 2013 (Montreal, Canada), Journal of the Acoustical Society of America, Volume 133, Issue 5, June 2013
Multi-microphone arrays allow for the use of spatial filtering techniques that can greatly improve noise reduction and source separation. However, for speech and audio data, work on noise reduction or separation has focused primarily on one- or two-channel systems. Because of this, databases of multichannel environmental noise are not widely available. DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) addresses this problem by providing a set of 16-channel noise files recorded in a variety of indoor and outdoor settings. The data was recorded using a planar microphone array consisting of four staggered rows, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. DEMAND is freely available under a Creative Commons license to encourage research into algorithms beyond the stereo setup.
Link to paper. This is the formal publication to refer to when citing DEMAND.2012
"Demonstration of a plasmonic thermocycler for the amplification of human androgen receptor DNA"
Analyst, Issue 19, 2012, The Royal Society of Chemistry
A plasmonic heating method for the polymerase chain reaction is demonstrated by the amplification of a section of the human androgen receptor gene. The thermocycler has a simple low-cost design, demonstrates excellent temperature stability and represents the first practical demonstration of plasmonic thermocycling.
Paper at RSC Publishing."DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments"
INRIA/IRISA Technical Report, October 11, 2012
DEMAND (Diverse Environments Multichannel Acoustic Noise Database) is a set of 16-channel recordings for research into multichannel noise reduction algorithms. It is provided under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
"Joint Entropy-Scalable Coding of Audio Signals"
ICASSP 2012, Kyoto, Japan, March 2012
A fine grain scalable coding for audio signals is proposed where the entropy coding of the quantizer outputs is made scalable. By constructing a Huffmann-like coding tree where internal nodes can be mapped to reconstruction points, we can prune the tree to control the distortion of the quantizer. Our results show the proposed method is improves on existing similar work and significantly outperforms scalable coding based on reconstruction error quantization as used in practical systems, eg. MPEG-4 audio.
Paper on IEEE Xplore.2011
"A Sparse Auditory Envelope Representation with Iterative Reconstruction for Audio Coding"
Ph.D. Thesis, McGill University, April 2011.
Modern audio coding exploits the properties of the human auditory system to
efficiently code speech and music signals. Perceptual domain coding is a branch
of audio coding in which the signal is stored and transmitted as a set of
parameters derived directly from the modeling of the human auditory system.
Often, the perceptual representation is designed such that reconstruction can
be achieved with limited resources but this usually means that some
perceptually irrelevant information is included. In this thesis, we investigate
perceptual domain coding by using a representation designed to contain only the
audible information regardless of whether reconstruction can be performed
efficiently. The perceptual representation we use is based on a multichannel
Basilar membrane model, where each channel is decomposed into envelope and
carrier components. We assume that the information in the carrier is also
present in the envelopes and therefore discard the carrier components. The
envelope components are sparsified using a transmultiplexing masking model and
form our basic sparse auditory envelope representation (SAER).
An iterative reconstruction algorithm for the SAER is presented that estimates
carrier components to match the encoded envelopes. The algorithm is split into
two stages. In the first, two sets of envelopes are generated, one of which
expands the sparse envelope samples while the other provides limits for the
iterative reconstruction. In the second stage, the carrier components are
estimated using a synthesis-by-analysis iterative method adapted from methods
designed for reconstruction from magnitude-only transform coefficients. The
overall system is evaluated using subjective and objective testing on speech
and audio signals. We find that some types of audio signals are reproduced very
well using this method whereas others exhibit audible distortion. We conclude
that, except for in some specific cases where part of the carrier information
is required, most of the audible information is present in the SAER and can be
reconstructed using iterative methods.
UPDATE 2: Everything (code and thesis) now also on github.
before 2011
"Using Salient Envelope Features for Audio Coding"
AES 34th Int. Conference (Jeju Island, Korea), August 2008.
In this paper, we present a perceptual audio coding method that encodes the audio using perceptually salient envelope features. These features are found by passing the audio through a set of gammatone filters, and then computing the Hilbert envelopes of the responses. Relevant points of these envelopes are isolated and transmitted to the decoder. The decoder reconstructs the audio in an iterative manner from these relevant envelope points. Initial experiments suggest that even without sophisticated entropy coding a moderate bitrate reduction is possible while retaining good quality.
Link to paper, link to presentation, please also see supplemental material.
"NeuriteTracer: A novel ImageJ plugin for automated quantification of neurite outgrowth"
Journal of Neuroscience Methods, Volume 168, Issue 1, pp. 134-139, February 2008.
Link to paper. This paper is the result of a collaboration with the group of Alyson Fournier at the Montreal Neurological Institute. For this paper, I assisted with the development and implementation of image processing algorithms for the measurement of neurite outgrowth.
"Reconstructing Audio Signals from Modified Non-Coherent Hilbert Envelopes"
Proc. Interspeech 2007 (Antwerpen, Belgium), pp. 534-537, August 2007.
In this paper, we present a speech and audio analysis-synthesis method based on a Basilar Membrane (BM) model. The audio signal is represented in this method by the Hilbert envelopes of the responses to complex gammatone filters uniformly spaced on a critical band scale. We show that for speech and audio signals, a perceptually equivalent signal can be reconstructed from the envelopes alone by an iterative procedure that estimates the associated carrier for the envelopes. The rate requirement of the envelope information is reduced by low-pass filtering and sampling, and it is shown that it is possible to recover a signal without audible distortion from the sampled envelopes. This may lead to improved perceptual coding methods.
"Methods and Devices for Audio Compression Based on ACELP/TCX Coding and Multi-Rate Lattice Vector Quantization"
Canadian Patent Application CA 2457988, filed Febuary 2004.
Link to patent. This patent was submitted as part of my work on the AMR-WB+ codec that I did while working for VoiceAge and GRPA the the University of Sherbrooke, under Prof. Roch Lefebvre.
"Low Distortion Acoustic Noise Suppression Using a Perceptual Model for Speech Signals"
Proc. IEEE Workshop Speech Coding (Tsukuba, Japan), pp. 172-174, Oct. 2002.
Algorithms for the suppression of acoustic noise in speech signals are generally Short-Time Spectral Amplitude (STSA) methods such as Spectral Subtraction. These methods have been effective at reducing or removing the background noise, but have a tendency (at low SNR) to add annoying artefacts, such as musical noise, and distortion of the speech signal. By employing an auditory model, psychoacoustic effects such as simultaneous masking can be used to apply spectral modification in a more effective manner, reducing the amount of overall modification necessary. In this way, the artefacts introduced by the processing are reduced. This paper proposes a method to significantly improve the reduction in the background acoustic noise in narrowband and wideband speech signals, even at low SNR. Here we show that the use of a subtraction strategy and psychoacoustic model originally intended for audio signals yields an output signal with little or no audible distortion.
"Noise Suppression using a Perceptual Model for Wideband Speech Signals"
Proc. Biennial Symp. Commun. (Kingston, ON), pp. 516-519, June 2002.
Traditional algorithms for suppressing background noise in speech signals can add annoying artefacts to the resulting denoised signal. In applications requiring better than toll quality, it is desirable that noise suppression should not add any audible artefacts. This paper describes a method that is effective for narrowband and applies these methods to wideband signals. The method presented uses a high-resolution psychoacoustic model originally developed for the evaluation of audio quality, and combines it with a method originally developed for audio signal enhancement. It is shown that while the method works well in narrowband applications, in wideband signals the quality needs to be improved.
"Acoustic Noise Suppression for Speech Signals using Auditory Masking Effects"
M. Eng. Thesis, McGill University, May 2001.
The process of suppressing acoustic noise in audio signals, and speech signals in particular, can be improved by exploiting the masking properties of the human hearing system. These masking properties, where strong sounds make weaker sounds inaudible, are calculated using auditory models. This thesis examines both traditional noise suppression algorithms and ones that incorporate an auditory model to achieve better performance. The different auditory models used by these algorithms are examined. A novel approach, based on a method to remove a specific type of noise from audio signals, is presented using a standardized auditory model. The proposed method is evaluated with respect to other noise suppression methods in the problem of speech enhancement. It is shown that this method performs well in suppressing noise in telephone-bandwidth speech, even at low Signal-to-Noise Ratios.