Festival 2.4: why some voices do not work with the singing mode?

Question

Festival 2.4: why some voices do not work with the singing mode?

voice_kal_diphone and voice_ral_diphone work correctly in singing mode (there are voice outputs, and tonal values are correct for the indicated notes).

voice_cmu_us_ahw_cg , and other CMU voices do not work correctly - there is a vocal output, but the step does not change in accordance with the indicated notes.

Is it possible to get the right result with higher quality CMU signals?

Command line for working (taking into account the pitch) output:

 text2wave -mode singing -eval "(voice_kal_diphone)" -o song.wav song.xml

Command line for non-working (without tone) output:

 text2wave -mode singing -eval "(voice_cmu_us_ahw_cg)" -o song.wav song.xml

Here's the song.xml :

 <?xml version="1.0"?> <!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []> <SINGING BPM="60"> <PITCH NOTE="A4,C4,C4"><DURATION BEATS="0.3,0.3,0.3">nationwide</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3">is</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">on</DURATION></PITCH> <PITCH NOTE="F4"><DURATION BEATS="0.3">your</DURATION></PITCH> <PITCH NOTE="F4"><DURATION BEATS="0.3">side</DURATION></PITCH> </SINGING>

You may also need this patch for singing-mode.scm :

 @@ -339,7 +339,9 @@ (defvar singing-max-short-vowel-length 0.11) (define (singing_do_initial utt token) - (if (equal? (item.name token) "") + (if (and + (not (equal? nil token)) + (equal? (item.name token) "")) (let ((restlen (car (item.feat token 'rest)))) (if singing-debug (format t "restlen %l\n" restlen))

To set up the environment, I used the festvox fest_build script . You can also download voice_cmu_us_ahw_cg separately .

+5

text-to-speech festival

Beau Dec 02 '15 at 9:05

source share

1 answer

avtomaton · Accepted Answer · 2015-12-11T21:18:00+0000

It seems that the problem is making phones.

voice_kal_diphone uses the UniSyn synthesis UniSyn , and voice_cmu_us_ahw_cg uses the ClusterGen model. The latter has its own model of intonation and duration (based on state) instead of the intonation / duration of the phone: you may have noticed that the duration has not changed in the generated “song” either.

singing-mode.scm tries to extract each syllable and change its frequency. In the case of ClusterGen model wave generator simply ignores the frequencies and durations of the syllables set in Target due to different simulations.

As a result, we get better voice quality (based on a statistical model), but we cannot directly change the frequency.

A very good description of the generation pipeline can be found here .

Festival 2.4: why some voices do not work with the singing mode?

More articles: