You need to evaluate the impulse response of the speaker and room, etc., which can vary with the exact location of the speakers and microphones, the size and contents of the room, etc., as well as knowing / evaluating the system delay.
If the person or microphone is moving, the impulse response and delay must be constantly counted.
After you have estimated the impulse response, you can drill it with the output signal and try to subtract the delayed versions of the result from the microphone input until you can turn off the quiet parts of the speech input. Cross-correlation may be useful for estimating latency.
source share