If you want to use rubber tape to change the duration and pitch, then I think that the difficult part will be displayed from phonemes / syllables in the text to the corresponding audio ranges in the sine output, for which I do not have a simple sentence, (Ideally, you get into a speech synthesizer so that it provides you with a mapping from phonemes to audio location.)
The simplest alternative would be to use Speech Synthesizer, the SSML markup language. It has the elements "pitch" and "duration", which can absolutely determine the step in Hz and the duration in seconds. You can also specify the volume to control the dynamics.
Given this, you can try to convert the text to an SSML document and mark words / syllables / phonemes with attributes of pitch / duration and volume.
source share