Writing software to indicate where the sound is coming from (directional listening)

I was curious about this for some time, so I thought that maybe here I can get some good answers.

What I know so far:

People can use their two ears to get not only what sounds “similar” but where they come from. A step is a note that we hear, and something like a human voice has various overlays (not a pure tone.)

What I would like to know:

How do I start writing a program that can know where the sound is coming from? From a theoretical point of view, I will need two microphones, then I record the audio data coming into the microphones and save the audio data so that split-second audio data can be placed in a tuple, for example [streamA, streamB].

I feel that there may be a formulaic / mathematical way of calculating based on the sound where the sound comes from. I also feel that you can take data on the flow and train the student (give him a sound sample and tell him where the sound came from) and classify the incoming sound in this way.

What is the best way to do this / are there good resources from which I can learn more about the subject?

EDIT:

Example:

front 

left (mic) x ======== x (mic) right

  back x (sound source should return "back" or "right" or "back right") 

I want to write a program that can go forward / backward left / right for most of the sound that is heard. From what I understand, it should be easy to set up two microphones, labeled "forward." Based on this, I am trying to understand the way we can triangulate sound and know where the source is relative to the microphones.

+6
source share
4 answers

If you study research articles on multiphase microphone arrays, in particular, which are used for underwater direction finding (that is, a large area of ​​research on submarines during the Cold War), where motor sound sounds, so can we aim torpedoes?), Then You will find the technology and math needed to locate sound at two or more microphone inputs.

This is not trivial, but not something that could be discussed here so broadly, but therefore you won’t find a simple code snippet and / or library to do what you need.

The main problem is the elimination of echo and shadow. The simplest way would be to start with a single tone, filter out everything except that tone, and then measure the phase difference between the two microphones of that tone. The phase difference will give you a lot of information about the location of the tone.

Then you can choose whether you want to deal with echoes and multipath issues (many of which can be eliminated by removing everything except the strongest tone) or switch to correlating sounds that consist of something other than one tone - a person, talking, or breaking a glass, for example. Start small and easy, and expand from there.

+5
source

I was looking for something similar and wrote a dumb answer here that was deleted. I had some ideas, but they really did not write them properly. Removing gave me that the internet bruise is the pride of the ego, so I decided to try the problem and I think it worked!

In fact, trying to find a real answer to Adam Davis’s question is very difficult, but making a human-style arrangement (looking at the first source, ignoring the echo or treating them as sources) is not so bad, I think, although I'm not a processing specialist signals by any means.

I read this and this . This made me realize that the problem is to find the time shift (cross-correlation) between the two signals. From there, you calculate the angle using the speed of sound. Please note that you will get two solutions (front and back).

The key information I read was in this answer and others on the same page that talks about how to quickly convert Fourier to scipy in order to find the cross-correlation curve.

Basically, you need to import the wave file in python. Cm. .

If your wave file (input) is a tuple with two numpy arrays (left, right), with zero margin, at least as long as it is (so that, apparently, its circular alignment), the code follows from Gustavo's answer. I think you need to admit that this ffts makes an assumption about time invariance, which means that if you want to get some kind of time-based signal tracking, you need to "bite" small samples of data.

I have cited the following code from the sources mentioned. It will display a graph showing the estimated time delay, in frames, from left to right (negative / positive). To convert to actual time, divide by the sampling rate. If you want to know which angle you need:

  • Suppose everything is on a plane (without height factor)
  • forget the difference between the front and back sound (you cannot tell)

You would also like to use the distance between the two microphones to make sure you are not getting an echo (the delay time is longer than with a 90-degree delay).

I understand that I took a lot of borrowed here, so thanks to all those who inadvertently contributed!

 import wave import struct from numpy import array, concatenate, argmax from numpy import abs as nabs from scipy.signal import fftconvolve from matplotlib.pyplot import plot, show from math import log def crossco(wav): """Returns cross correlation function of the left and right audio. It uses a convolution of left with the right reversed which is the equivalent of a cross-correlation. """ cor = nabs(fftconvolve(wav[0],wav[1][::-1])) return cor def trackTD(fname, width, chunksize=5000): track = [] #opens the wave file using pythons built-in wave library wav = wave.open(fname, 'r') #get the info from the file, this is kind of ugly and non-PEPish (nchannels, sampwidth, framerate, nframes, comptype, compname) = wav.getparams () #only loop while you have enough whole chunks left in the wave while wav.tell() < int(nframes/nchannels)-chunksize: #read the audio frames as asequence of bytes frames = wav.readframes(int(chunksize)*nchannels) #construct a list out of that sequence out = struct.unpack_from("%dh" % (chunksize * nchannels), frames) # Convert 2 channels to numpy arrays if nchannels == 2: #the left channel is the 0th and even numbered elements left = array (list (out[0::2])) #the right is all the odd elements right = array (list (out[1::2])) else: left = array (out) right = left #zero pad each channel with zeroes as long as the source left = concatenate((left,[0]*chunksize)) right = concatenate((right,[0]*chunksize)) chunk = (left, right) #if the volume is very low (800 or less), assume 0 degrees if abs(max(left)) < 800 : a = 0.0 else: #otherwise computing how many frames delay there are in this chunk cor = argmax(crossco(chunk)) - chunksize*2 #calculate the time t = cor/framerate #get the distance assuming v = 340m/s sina=(t*v)/width sina = t*340/width a = asin(sina) * 180/(3.14159) #add the last angle delay value to a list track.append(a) #plot the list plot(track) show() 

I tried this using some stereo sound that I found in equilogy . I used an example car (stereo file). He produced this .

To do this on the fly, I think you need to have an incoming stereo source that you could “listen” for a short time (I used 1000 frames = 0.0208 s), and then calculate and repeat.

[edit: found you can easily use the fft convolve function using the inverted time series of one of the two to make a correlation]

+3
source

This is an interesting problem. I do not know any reference material for this, but I have some experience in audio and signal processing that can help you in the right direction.

Determining the direction of the sound source (where the sound comes from you) is pretty simple. Get 6 directional microphones and point them up, down, front, back, left and right. By looking at the relative amplitudes of microphone signals in response to sound, you can pretty easily determine in which direction a particular sound is going. Increase the number of microphones to increase resolution.

2 microphones will tell you only whether the sound is on the right or left. The reason your 2 ears can understand if the sound is coming from behind or behind you is because the external structure of your ear changes the sound depending on the direction that your brain interprets and then corrects.

+2
source

Cross-correlation is the main method, but has some features. There are various approaches that help you efficiently detect a source with an array of microphones. Some of them also work without calibration, some require calibration to adapt to the geometry of the room.

You can try existing open source software for the task of source localization

Distribution and localization of the sound source of Franears robots https://sourceforge.net/projects/manyears/

HARK Toolkit for Robotics Applications http://www.ros.org/wiki/hark

+2
source

Source: https://habr.com/ru/post/904643/


All Articles