AWS Polly in Java

In this tutorial, we will discuss Amazon Polly in detail.

What is Amazon Polly?

Amazon Polly is a cloud service by Amazon Web Services, a subsidiary of Amazon.com, that converts text into lifelike speech. It allows the creation of applications that talk and build entirely new categories of speech-enabled products. Amazon Polly supports multiple languages and includes a variety of lifelike voices so that we can build speech-enabled applications that work in various locations and use the ideal voice for our customers.

With the help of Amazon Polly, we only pay for the text we synthesize, and we can also cache and replay Amazon Polly-generated speech at no additional cost. Amazon Polly includes several neural text-to-speech that is NTTS voices delivering groundbreaking improvements in speech quality through a new machine learning approach by offering customers the most natural and human-like text-to-speech voices possible. NTTS technology also supports a unique scarcer speaking style tailored to new narration use cases. So, this was an overview of what exactly Amazon Polly is.

Now let us discuss a few of the benefits of Amazon Polly.

Advantages of Amazon Polly

  1. The first point is Amazon Polly offers new neural TTS and best-in-class standard TTS technology to synthesize superior natural speech with high pronunciation accuracy, including abbreviations, acronyms, expansions, data/time interpretations, and homograph disambiguation.
  2. The second advantage of using Amazon Polly is the low latency. Amazon Polly ensures fast responses, which makes it a viable option for everyday latency use cases such as dialogue systems.
  3. The third advantage is support for an extensive portfolio of languages and voices. Amazon Polly supports dozens of voice languages offering male and female voice options for most languages. NTTS currently supports three British English voices and eight US English voices, and this number will continue to increase as we bring in more neural voices online.
  4. The next advantage is cost-effectiveness. Amazon Polly's paper use model means there are no setup costs, and we can start small and scale up as our application grows.
  5. The next benefit is the cloud-based solution. On-device TTS solutions require significant computing resources, including CPU power, RAM and disk space.These can result in higher development costs and power consumption on devices such as tablets, smartphones, In contrast, TTS conversion in the AWS cloud dramatically reduced the local resource requirements, enabling support of all the available languages and voices at best possible quality. Moreover, speech improvements are instantly general to all uses and do not require additional device updates.

How Amazon Polly Works?

Amazon Polly converts input text into lifelike speech. We call one of the speech synthesis methods to provide the text we want to synthesize, choose one of the neural texts to speech that is NTTS or standard text-to-speech that is TTS voices and specify an audio output format.

Amazon Polly then synthesizes the provided text into a high-quality speech audio stream. So, for using Amazon Polly, we first need to provide the input text we want to synthesize, and Amazon Polly returns an audio stream. We can provide the input as plain text or in speech synthesis markup language in SSML format. So, with this format, we can control various aspects of speech, such as pronunciation, volume, pitch and speech rate.

The next is the available voices. So, Amazon Polly provides a portfolio of languages and a variety of voices, including a bilingual voice for both English and Hindi. For most languages, we can choose from several voices, both male and female. When launching a speech synthesis task, we specify the voice ID, and then Amazon Polly uses this voice to convert the text to speech. Amazon Polly is not a translation service; the synthesized speech is in the same language as the text. However, if the text is in a different language, then the designated voice numbers represented as digits are synthesized in the voice language, not the text.

The last one is the output format. Amazon Polly can deliver a synthesized speech in multiple formats, and we can select the audio format that suits our needs. For example, we might request the speech in the MP3 form for consumption by the web and mobile applications or ask for the PCM output format for consumption by AWS IoT devices and telephony solutions.

We will discuss some of the use cases of Amazon Polly. There are three use cases in Amazon Polly:

  1. Content creation
  2. E-Learning
  3. Telephony

Content-creation

The first use case is the content-creation. So, audio can be a complementary medium to written or visual communication. By voicing our content, we can provide our audience with an alternative way to consume information and meet the needs of a larger pool of readers. Amazon Polly can generate speech in dozens of languages, making it easy to add speech to applications with a global audience, such as RSS feeds, websites or videos.

For example, if we have written a blog on WordPress; we can provide the text that we have written in our blog on WordPress so that it can convert into an audio file, and once the audio file is restored, we can use it on our mobile devices so that whenever we have free time so that we can listen to the audio that we have with us. So, this way, we can have a lot of content that is available in the audio format with us. So, this is the basic idea behind the podcasts, also.

E-Learning

Amazon Polly enables developers to provide their applications with an enhanced visual experience, such as speech-synchronized facial animation. Amazon Polly makes it easy to request an additional metadata stream with information about when particular sentences, words and sounds are pronounced. Using this metadata stream alongside the synthesized speech audio stream, customers can animate avatars and highlight text as it is currently spoken in their application.

An example is playing speech and highlighting the spoken text, and this is an excellent use of Amazon Polly in the E-Learning field.

AWS Polly in Java

Telephony

With the help of Amazon Polly, our contact centers can engage customers with natural-sounding voices. We can cache and replay Amazon Polly's speech output to prompt callers to be interactive voice responses, IVR systems such as Amazon connect. Additionally, we can leverage Amazon Polly's API to deliver automated real-time information such as service status accounts, billing inquiries, addresses, and contact information.

An example here is text-to-speech for telephony systems. So, these were some use cases which Amazon Polly will use.

Speech Marks

  1. Speech marks are metadata describing the speech we synthesize, such as where a sentence or word starts and ends in the audio stream.
  2. When we request speech marks for our text, Amazon Polly returns this metadata instead of synthesized speech.
  3. Using speech marks in conjunction with the synthesized speech audio stream can provide our applications with an enhanced visual experience.
  4. For example, combining the metadata with the audio stream from our text can enable us to synchronize speech with facial animation, like lip-sinking or highlighting written words as they are spoken.
  5. Speech marks are available when using either neural or standard text-to-speech formats.

Speech Synthesis Markup Language (SSML)

Amazon Polly can generate speech from either plain text or documents marked up with SSML. We can bring the following effects with SSML:

  1. Emphasizing specific words or phrases
  2. Including a long pause
  3. Changing the speech rate or pitch
  4. Using phonetic pronunciation
  5. Including breathing sounds
  6. Whispering
  7. Using the Newscaster speaking style

Lexicons

  1. Pronunciation lexicons enable us to customize the pronunciation of words.
  2. They are specific to a particular region.
  3. We can use one or more of the lexicons from that region when synthesizing the text using the "SynthesizeSpeech"

Example

  1. "g3t sm4rt" should be read as get smart
  2. An acronym is read in its complete form. Example: "EC2" should be read as "Elastic Compute Cloud".
  3. To get the correct pronunciation for the names of the person.

Demo

Let us discuss a short demo of how exactly Amazon Polly works. Now, this demo is based on text-to-speech conversion. So, let us get into it briefly. The concept of text-to-speech software is simple. For suppose take a paragraph, a page, an article, or even a whole book, and we have a computer read it aloud. When people think about text-to-speech, they often associate it with robotic voices and stilted cadences. However, this usually is not the case, particularly with modern software. Text-to-speech may sound like a gimmick for some people, but it is a technology with efficient applications.

Let us discuss some advantages of text-to-speech software.

  1. It enables people with disabilities to read. So, the most apparent use of text-to-speech software allows people with visual impairments to consume written content.
  2. It provides hands of the reading experience. Even if our eyesight is perfect, sometimes it is more comfortable or convenient to listen to something instead of reading it.
  3. In some situations where audio versions of content are unavailable, most popular books are released in the audio format these days. However, the same does not hold for most other written content, including articles, poems and more. So, text-to-speech software enables us to listen to any written content we want as long as the functionality is built in.





Latest Courses