Google Cloud Text-to-Speech and Speech-to-Text APIs get a host of updates today that introduce support for multiple languages, makes it easier to hear automatically generated voices on different speakers and promises better transcripts thanks to improved speaker recognition tools, among other things.
With this update, Cloud Text-to-Speech API is now also widely available.
Let's look at the details. The highlight of the release for many developers is probably the launch of the 17 new WaveNet-based voices in a number of new languages. WaveNet is Google's technology for using machine learning to create these text-to-speech audio files. The result of this is a more natural sounding voice.
With this update, Text-to-Speech API now supports 1
If you want To try out the new voices, you can use Google's demo with your own text here.
Another interesting new feature is the beta launch of audio profiles. The idea here is to optimize the audio file for the medium you want to use to play it. The phone's speaker is different from the audio field below your TV, after all. With sound profiles, you can optimize the sound for phone calls, headphones and speakers, for example.
On the talk-to-text page, Google now makes it easier for developers to transcribe multiple-speaker samples. Using machine learning, the service can now recognize the different speakers (although you still need to tell how many speakers there are in a given sample), then each word is labeled with a speaker number. If you have a stereo file on two speakers (maybe a call center agent on the left and the angry customer who called to complain to the right), Google can now use these channels to distinguish between speakers as well.
New is also support for several languages. There is something Google Search App supports, and the company now also makes this available to developers. Developers can choose up to four languages, and the voice-to-text API will automatically determine which language is being spoken.
Finally, the API returns to text level also reaching the level of trust. It may sound like a minor thing – and it has already returned points for each segment of speech – but Google notes that developers can now use this to build programs that focus on particular words. "For example, if you're using a user login, please set up a meeting with John tomorrow at 2 in" in your app, you can decide to ask the user to repeat "John" or "2 PM" if they either have low confidence, but not to recover for & # 39; please & # 39; even though it has low confidence because it is not critical to the current sentence, explains the team.