At first I am going to state in general what I am trying to do, and ask for advice. Then I will explain my current approach and ask for answers to my current problems.
Problem
I have a person talking MP3 file. I would like to break it into segments that correspond to a phrase or phrase. (I would do it manually, but we are talking about a data clock.)
If you have tips on how to do this programmatically or for some existing utilities, I'd love to hear it. (I know about voice activity detection, and I studied it a bit, but I did not see any utilities available.)
Current approach
I thought that the easiest way would be to scan MP3s at regular intervals and identify places where the average volume was below a certain threshold. Then I would use some existing utility to cut mp3 in these places.
I played with pymad and I believe that I have successfully extracted PCM (Pulse Code Modulation) data for each mp3 frame. Now I'm stuck because I can't seem to turn around as the PCM data goes to relative volume. I also know about other complicating factors, such as multiple channels, the big endian versus the small one, etc.
Advice on how to match a group of pcm samples to a relative volume will be key.
Thank!
source
share