I have been working on this very problem (in C though) for a while, and when I think I have it, the Internet gets busy or somehow changes the place and the boom! Some intermittent sounding again. Well. I'm pretty sure that they licked me now.
Using the algorithm below, I have really good sound quality. I compared it to other softphones that I ran under the same network conditions, and it works noticeably better.
The first thing I do is try to determine if the PBX or other SIP proxy server we are connecting to is on the local network with the UA softphone or not.
If so, I define my jitterbuffer as 100ms, if not, I use 200ms. This way I will limit my delay if I can; even 200 ms does not cause noticeable problems with conversations or excessive conversation.
So. Then I use a system counter of any type that you have, for example. Windows = GetTickCount64 () to populate the variable with millisecond precision when my first packet came in for playback. Let us call this variable "x".
Then, when ((GetTickCount64 () - x)> jitterbuffer) is true, I start playback in this buffer.
Direct implementation of fixed-length jitter buffer. Here's a tricky bit.
While I am decoding an RTP frame (for example, from muLaw to PCM) to buffer it for playback, I calculate the AVERAGE ABSOLUTE amplitude of the sound frame and save it along with the frame for playback.
I do this with this structure:
typedef struct tagCCONNECTIONS { char binuse; struct sockaddr_in client; SOCKET socket; unsigned short media_index; UINT32 media_ts; long ssrc; unsigned long long lasttimestamp; int frames_buffered; int buffer_building; int starttime; int ssctr; struct { short pcm[160]; } jb[AUDIO_BUFFER]; char jbstatus[AUDIO_BUFFER]; char jbsilence[AUDIO_BUFFER]; int jbr,jbw; short pcms[160]; char status; PCMS *outraw; char *signal; WAVEHDR *preparedheaders; DIALOGITEM *primary; int readptr; int writeptr; } CCONNECTIONS;
Ok, pay attention to the element struct tagCCONNECTIONS :: jbsilence [AUDIO_BUFFER]. Thus, for each decoded sound frame in tagCCONNECTIONS :: jb [x] .pcm [] there is corresponding data on whether this frame is heard or not.
This means that for each sound frame that needs to be played back, we have information about whether this frame is audible.
Also...
#define READY 1 #define EMPTY 0
The tagCCONNECTIONS :: jbstatus [AUDIO_BUFFER] field tell us if the specific sound frame that we are thinking of playing is READY or EMPTY. In the theoretical case of a buffer overflow, it MAY be empty, and in this case we usually expect it to be READY, and then start playing ...
Now in my program that plays audio ... I have two main functions. One of them is called pushframe () and one is called popframe ().
My thread, which opens a network connection and receives pushframe () RTP calls, which converts muLaw to PCM, calculates the amplitude of the AVERAGE ABSOLUTE frame and marks it as silent if it is not audible, and marks :: jbstatus [x] as DONE
Then, in my audio stream, we first check if the jitterbuffer timed out,
if ( ( GetTickCount64() - x ) > jitterbuffer ) {...}
Then we check whether the next frame that will be played is READY (this means that it really was full).
Then we check if the frame is AFTER THE FRAME IS READY, AND IF IT IS SIGNIFICANT OR QUIET!
*** IMPORTANT
Basically, we know that a 200 ms jitter buffer can contain ten 20 ms sound frames.
If at any moment after the initial delay of the jitter buffer by 200 ms (sound preservation) the number of queued audio frames falls below 10 (or jitterbuffer / 20), we go on to what I call the "buffer_building" mode. Where, if the next sound frame that we plan to play is silent, we tell the program that the jitter buffer is not full yet, it is still at a distance of 20 milliseconds from the full one, but we go ahead and play in the frame that we now, because it is a NEXT frame, we see that it is โquietโ ... again. We simply do not play the silent frame and use the silence period to wait on the incoming frame to replenish our buffer.
tagCCONNECTIONS::lasttimestamp = GetTickCount64() - (jitterbuffer-20);
This will have a period of complete silence during what would be โacceptedโ silence, but allows the buffer to replenish itself. Then, when I fill the full 10 frames again, I exit buffer_building mode and just play the audio.
I enter the "buffer_building" mode, even if we have a short frame from a full buffer, because a person with a long branch can talk, and there can be a lot of silence. This can lead to fast buffer depletion even in the "buffer_building" mode.
Now ... "What is silence?" I hear you asking. In my mess, I am hard-coded silence like any frame with AVERAGE ABSOLUTE 16 bit PCM amplitude of less than 200. I find it as follows:
int total_pcm_val=0; for (i=0;i<160;i++) { total_pcm_val+=ABS(cc->jb[jbn].pcm[i]); } total_pcm_val/=160; if (total_pcm_val < 200) { cc->jbsilence[jbn] = 1; }
Now I intend to maintain the overall average amplitude in this compound and play around, perhaps if the current amplitude of the audio frame we just received is 5% or less of the total average amplitude, then we consider the frame to be silent, or maybe 2% ... I do not know, but if there is a lot of wind or background noise, the definition of "silence" can be adapted. I have to play with this, but I think this is the key to replenishing the jitter buffer.
Do this when there is no important listening information and keep the real information (their voice) crystal clear.
Hope this helps. I'm a little upset when it comes to explaining things, but I am very, very pleased with how my VoIP application sounds.