Jan. 11th, 2023

kestrell: (Default)
Excerpted from https://mspoweruser.com/vall-e-copies-speakers-voices-to-synthesize-speech/

In an experiment detailed in a paper (Cornell University), VALL-E was tested and led to favorable results.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” the paper reads. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

In some of the samples shared, the synthesized speeches using acoustic prompts sound almost flawless. VALL-E managed to copy the same tones and emotions from the original speakers and even used them in delivering a very different personalized speech. For instance, it was able to produce recordings of the same sentence (“We have to reduce the number of plastic bags“) delivered in different moods or tones, such as anger, sleepiness, neutrality, amusement, and disgust.

Despite this exceptional performance, Microsoft probably has further plans to improve VALL-E more in the future to help it provide a more impeccable performance. And while it can be useful for various case scenarios, the technology can also be dangerous under the hands of the wrong individuals. Thankfully, it is currently unavailable to the public, which could give the Redmond company more time to think about how and where it will offer this technology.
https://mspoweruser.com/vall-e-copies-speakers-voices-to-synthesize-speech/

In an experiment detailed in a paper (Cornell University), VALL-E was tested and led to favorable results.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” the paper reads. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

In some of the samples shared, the synthesized speeches using acoustic prompts sound almost flawless. VALL-E managed to copy the same tones and emotions from the original speakers and even used them in delivering a very different personalized speech. For instance, it was able to produce recordings of the same sentence (“We have to reduce the number of plastic bags“) delivered in different moods or tones, such as anger, sleepiness, neutrality, amusement, and disgust.

Despite this exceptional performance, Microsoft probably has further plans to improve VALL-E more in the future to help it provide a more impeccable performance. And while it can be useful for various case scenarios, the technology can also be dangerous under the hands of the wrong individuals. Thankfully, it is currently unavailable to the public, which could give the Redmond company more time to think about how and where it will offer this technology.

February 2024

S M T W T F S
    123
456789 10
11121314151617
18192021222324
2526272829  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 22nd, 2025 07:51 am
Powered by Dreamwidth Studios