Since with a bird song the “modulation frequency” is likely to be much lower than the “carrier frequency” even with a rapidly changing amplitude, approximating the envelope can be obtained by taking the absolute value of your signal, and then applying a moving average filter with a length of 20 ms.
And yet, would you not be interested in frequency variations in order to adequately characterize the song? In this case, the Fourier transform over a moving window will give you much more information, namely the approximate frequency content as a function of time. This is what we humans hear and helps us distinguish between species of birds.
I don’t have access to the link you sent me:
"There is no problem with the prevailing Borradors."
If you do not want attenuation, you should not use the Butterworth filter and not take the moving average, but use peak detection instead.
Moving average: each output sample is an average absolute value, for example. 50 previous input samples. The output will be weakened.
Peak detection: each output sample is the maximum value of the absolute value, for example. 50 previous input samples. The output will not be weakened. You can then turn off the filter to get rid of the remaining riple ladder.
You are wondering why, for example, the Butterworth filter will attenuate your signal. This hardly does if your cutoff frequency is high enough, but it just SEEMS is greatly attenuated. Your input signal is not the sum of the carrier (whistle) and modulation (envelope), but the product. Filtering limits the frequency content. What remains is the frequency components (terms), not the factors. You see the weakened modulation (envelope), because this frequency component is really present in your signal, MUCH more than the original envelope, since it was not added to your medium, but multiplied by it. Since the sinusoidal medium by which your envelope is multiplied does not always have the maximum value, the envelope will be “weakened” by the modulation process, not by filtering analysis.
In short: if you directly want a (multiplicative) envelope rather than an (additive) frequency component due to envelope modulation (multiplication), use the peak detection approach.
The peak detection algorithm in the Pythonish pseudo-code to get this idea.