Speaker Recognition

How could I distinguish between two people speaking? As with someone saying hello and then another person saying hello, what signature should I look for in the audio data? periodicity?

Thanks to everyone who can answer this question.

+2
source share
4 answers

The solution to this problem is digital signal processing (DSP). Speaker recognition is a complex issue that allows the computer and communications technology to work hand in hand. Most speaker identification methods require signal processing with machine learning (training on the speaker database, then identification using training data). The scheme of the algorithm that can be executed is

  • Record audio in raw format. This serves as a digital signal that needs to be processed.
  • Apply some pre-processing procedures for the captured signal. These procedures can simply be a normalization signal or filtering a signal to remove noise (using bandpass filters for the normal frequency range of a human voice). In turn, bandpass filters can be formed using a low pass and high pass filter in combination.)
  • As soon as there is sufficient confidence that the captured signal is substantially free of noise, the phase of extraction of the function begins. Some of the well-known methods that are used to extract voice functions are Mel Frequency Cepstral Coefficients ( MFCC ), Linear Predictive Coding ( LPC ), or Simple FFT .
  • Now there are two phases - training and testing .
  • First, the system must be trained in the voice functions of different speakers before it can distinguish between them. To ensure that the functions are correctly calculated, it is recommended that for training purposes, several (> 10) voice samples from the columns be collected.
  • Training can be performed using various methods, such as neural networks or distance classification , to find differences in the characteristics of voices from different columns.
  • At the testing stage, the training data is used to search for a set of voice functions that lies at the lowest distance from the signal being tested. To calculate this proximity, you can use various distances, for example, Euclidean or Chebyshev .

There are two open source versions that allow you to identify the speaker - ALIZE : http://mistral.univ-avignon.fr/index_en.html and MARF : http://marf.sourceforge.net/ .

I know him a little late to answer this question, but I hope someone finds it useful.

+11
source

This is an extremely difficult problem, even for professionals in the field of speech and signal processing. There is much more information on this page: http://en.wikipedia.org/wiki/Speaker_recognition

And some suggested technology points:

Various technologies used for processing and storing voice prints include frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, vector Quantization and decision trees. Some systems also use anti-speaker techniques, such as cohort models, and world models.

+2
source

Having only two people to distinguish if they pronounce the same word or phrase will make it a lot easier. I suggest starting with something simple and adding complexity only if necessary.

To get started, I would try counting the number of digital signals encoded in time and magnitude, or (if you have software functionality) the FFT of the entire utterance. First, I would consider a basic modeling process, such as linear discriminant (or what you already have).

0
source

Another way is to use an array of microphones and distinguish between the positions and directions of voice sources. I think this is a simpler approach, since position calculation is much less complicated than separating different speakers from a mono or stereo source.

0
source

Source: https://habr.com/ru/post/978564/


All Articles