Microsoft and scientists at the University of Toronto (U of T), have developed new software that is capable of translating verbal messages to a different language in the original speaker’s voice.
The technology was presented by Rick Rashid, Microsoft’s chief research officer, during a presentation in China Oct. 25.
The technology first takes the text of one’s speech, collected through voice recognition software, and translates it word by word into the desired language, Rashid explained during the presentation.
A text-to-speech system then converts the translated text, which is reordered so as to make sense in the new language, into verbal speech.
The system plays back the translated message in the voice of the original speaker, through the use of samples from both the speaker and native speakers of the translated language.
This innovative project was made possible by a breakthrough in speech recognition technology by Microsoft Research and scientists at U of T over the last few years, Rashid said, before demonstrating the process by translating his own speech into Mandarin at the presentation.
“The idea that they had was to use technology in a way that [was] patterned after the human brain works,” Rashid said.
This technique, called Deep Neural Networks, increased speech recognition rates by approximately 30 per cent, he said.
“That’s the difference between going from 20 to 25 per cent errors, or about one error out of every four or five words, to roughly 15 per cent; roughly one out of every seven words or perhaps even one out of eight,” Rashid said.
George Dahl, a graduate student at U of T, said it was he and his colleague, Abdel-Rahman Mohamed, who were the first to apply the deep learning techniques to acoustic modelling.
“Microsoft saw our results on a small speech benchmark and recruited us as interns,” Dahl said via email.
“During my internship, my collaborators at Microsoft and I showed that deep neural nets could dramatically improve accuracy on a large vocabulary speech recognition task,” he said. “Since then, Microsoft and others have continued to apply these techniques to other speech recognition products and benchmarks, with outstanding results.”
Rashid also attributed the success of this software to a major change in speech recognition technology during the late 1970s, thanks to work that was done at Carnegie Mellon University.
“The idea was to use a statistical modelling technique . . . to really be able to take a lot of data from many speakers and produce more robust statistical models of speech,” Rashid said.
“That was a huge improvement and over the last 30 years, speech recognition systems have become dramatically better than they used to be. They still make a lot of mistakes, but in limited domains it’s possible to do very successful speech interfaces.”
While Rashid noted that this technology is still far from perfect, such advancements give way to plenty of potential in this field in the years to come.
“There’s much work to be done in this area,” Rashid said.
“This technology is very promising and we hope in a few years that we will be able to break down the language barriers between people. Personally, I think this is going to lead to a better world.”