kestrell

Excerpted from https://mspoweruser.com/vall-e-copies-speakers-voices-to-synthesize-speech/

In an experiment detailed in a paper (Cornell University), VALL-E was tested and led to favorable results.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” the paper reads. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

In some of the samples shared, the synthesized speeches using acoustic prompts sound almost flawless. VALL-E managed to copy the same tones and emotions from the original speakers and even used them in delivering a very different personalized speech. For instance, it was able to produce recordings of the same sentence (“We have to reduce the number of plastic bags“) delivered in different moods or tones, such as anger, sleepiness, neutrality, amusement, and disgust.

Despite this exceptional performance, Microsoft probably has further plans to improve VALL-E more in the future to help it provide a more impeccable performance. And while it can be useful for various case scenarios, the technology can also be dangerous under the hands of the wrong individuals. Thankfully, it is currently unavailable to the public, which could give the Redmond company more time to think about how and where it will offer this technology.
https://mspoweruser.com/vall-e-copies-speakers-voices-to-synthesize-speech/

In an experiment detailed in a paper (Cornell University), VALL-E was tested and led to favorable results.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” the paper reads. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

In some of the samples shared, the synthesized speeches using acoustic prompts sound almost flawless. VALL-E managed to copy the same tones and emotions from the original speakers and even used them in delivering a very different personalized speech. For instance, it was able to produce recordings of the same sentence (“We have to reduce the number of plastic bags“) delivered in different moods or tones, such as anger, sleepiness, neutrality, amusement, and disgust.

Despite this exceptional performance, Microsoft probably has further plans to improve VALL-E more in the future to help it provide a more impeccable performance. And while it can be useful for various case scenarios, the technology can also be dangerous under the hands of the wrong individuals. Thankfully, it is currently unavailable to the public, which could give the Redmond company more time to think about how and where it will offer this technology.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29

Jan. 11th, 2023

Jan. 11th, 2023

Microsoft’s VALL-E copies original speakers’ voices, emotions to synthesize personalized speeches

Profile

February 2024

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags