Festival 2.4: why some voices do not work with the singing mode?

voice_kal_diphone and voice_ral_diphone work correctly in singing mode (there are voice outputs, and tonal values ​​are correct for the indicated notes).

voice_cmu_us_ahw_cg , and other CMU voices do not work correctly - there is a vocal output, but the step does not change in accordance with the indicated notes.

Is it possible to get the right result with higher quality CMU signals?

Command line for working (taking into account the pitch) output:

 text2wave -mode singing -eval "(voice_kal_diphone)" -o song.wav song.xml 

Command line for non-working (without tone) output:

 text2wave -mode singing -eval "(voice_cmu_us_ahw_cg)" -o song.wav song.xml 

Here's the song.xml :

 <?xml version="1.0"?> <!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []> <SINGING BPM="60"> <PITCH NOTE="A4,C4,C4"><DURATION BEATS="0.3,0.3,0.3">nationwide</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3">is</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">on</DURATION></PITCH> <PITCH NOTE="F4"><DURATION BEATS="0.3">your</DURATION></PITCH> <PITCH NOTE="F4"><DURATION BEATS="0.3">side</DURATION></PITCH> </SINGING> 

You may also need this patch for singing-mode.scm :

 @@ -339,7 +339,9 @@ (defvar singing-max-short-vowel-length 0.11) (define (singing_do_initial utt token) - (if (equal? (item.name token) "") + (if (and + (not (equal? nil token)) + (equal? (item.name token) "")) (let ((restlen (car (item.feat token 'rest)))) (if singing-debug (format t "restlen %l\n" restlen)) 

To set up the environment, I used the festvox fest_build script . You can also download voice_cmu_us_ahw_cg separately .

+5
source share
1 answer

It seems that the problem is making phones.

voice_kal_diphone uses the UniSyn synthesis UniSyn , and voice_cmu_us_ahw_cg uses the ClusterGen model. The latter has its own model of intonation and duration (based on state) instead of the intonation / duration of the phone: you may have noticed that the duration has not changed in the generated β€œsong” either.

singing-mode.scm tries to extract each syllable and change its frequency. In the case of ClusterGen model wave generator simply ignores the frequencies and durations of the syllables set in Target due to different simulations.

As a result, we get better voice quality (based on a statistical model), but we cannot directly change the frequency.

A very good description of the generation pipeline can be found here .

+1
source

Source: https://habr.com/ru/post/1237235/


All Articles