AWS Polly in Java
In this tutorial, we will discuss Amazon Polly in detail.
What is Amazon Polly?
Amazon Polly is a cloud service by Amazon Web Services, a subsidiary of Amazon.com, that converts text into lifelike speech. It allows the creation of applications that talk and build entirely new categories of speech-enabled products. Amazon Polly supports multiple languages and includes a variety of lifelike voices so that we can build speech-enabled applications that work in various locations and use the ideal voice for our customers.
With the help of Amazon Polly, we only pay for the text we synthesize, and we can also cache and replay Amazon Polly-generated speech at no additional cost. Amazon Polly includes several neural text-to-speech that is NTTS voices delivering groundbreaking improvements in speech quality through a new machine learning approach by offering customers the most natural and human-like text-to-speech voices possible. NTTS technology also supports a unique scarcer speaking style tailored to new narration use cases. So, this was an overview of what exactly Amazon Polly is.
Now let us discuss a few of the benefits of Amazon Polly.
Advantages of Amazon Polly
How Amazon Polly Works?
Amazon Polly converts input text into lifelike speech. We call one of the speech synthesis methods to provide the text we want to synthesize, choose one of the neural texts to speech that is NTTS or standard text-to-speech that is TTS voices and specify an audio output format.
Amazon Polly then synthesizes the provided text into a high-quality speech audio stream. So, for using Amazon Polly, we first need to provide the input text we want to synthesize, and Amazon Polly returns an audio stream. We can provide the input as plain text or in speech synthesis markup language in SSML format. So, with this format, we can control various aspects of speech, such as pronunciation, volume, pitch and speech rate.
The next is the available voices. So, Amazon Polly provides a portfolio of languages and a variety of voices, including a bilingual voice for both English and Hindi. For most languages, we can choose from several voices, both male and female. When launching a speech synthesis task, we specify the voice ID, and then Amazon Polly uses this voice to convert the text to speech. Amazon Polly is not a translation service; the synthesized speech is in the same language as the text. However, if the text is in a different language, then the designated voice numbers represented as digits are synthesized in the voice language, not the text.
The last one is the output format. Amazon Polly can deliver a synthesized speech in multiple formats, and we can select the audio format that suits our needs. For example, we might request the speech in the MP3 form for consumption by the web and mobile applications or ask for the PCM output format for consumption by AWS IoT devices and telephony solutions.
We will discuss some of the use cases of Amazon Polly. There are three use cases in Amazon Polly:
The first use case is the content-creation. So, audio can be a complementary medium to written or visual communication. By voicing our content, we can provide our audience with an alternative way to consume information and meet the needs of a larger pool of readers. Amazon Polly can generate speech in dozens of languages, making it easy to add speech to applications with a global audience, such as RSS feeds, websites or videos.
For example, if we have written a blog on WordPress; we can provide the text that we have written in our blog on WordPress so that it can convert into an audio file, and once the audio file is restored, we can use it on our mobile devices so that whenever we have free time so that we can listen to the audio that we have with us. So, this way, we can have a lot of content that is available in the audio format with us. So, this is the basic idea behind the podcasts, also.
Amazon Polly enables developers to provide their applications with an enhanced visual experience, such as speech-synchronized facial animation. Amazon Polly makes it easy to request an additional metadata stream with information about when particular sentences, words and sounds are pronounced. Using this metadata stream alongside the synthesized speech audio stream, customers can animate avatars and highlight text as it is currently spoken in their application.
An example is playing speech and highlighting the spoken text, and this is an excellent use of Amazon Polly in the E-Learning field.
With the help of Amazon Polly, our contact centers can engage customers with natural-sounding voices. We can cache and replay Amazon Polly's speech output to prompt callers to be interactive voice responses, IVR systems such as Amazon connect. Additionally, we can leverage Amazon Polly's API to deliver automated real-time information such as service status accounts, billing inquiries, addresses, and contact information.
An example here is text-to-speech for telephony systems. So, these were some use cases which Amazon Polly will use.
Speech Synthesis Markup Language (SSML)
Amazon Polly can generate speech from either plain text or documents marked up with SSML. We can bring the following effects with SSML:
Let us discuss a short demo of how exactly Amazon Polly works. Now, this demo is based on text-to-speech conversion. So, let us get into it briefly. The concept of text-to-speech software is simple. For suppose take a paragraph, a page, an article, or even a whole book, and we have a computer read it aloud. When people think about text-to-speech, they often associate it with robotic voices and stilted cadences. However, this usually is not the case, particularly with modern software. Text-to-speech may sound like a gimmick for some people, but it is a technology with efficient applications.
Let us discuss some advantages of text-to-speech software.