IN BRIEF: Researchers have discovered that speech recognition software Deep Speech 2 has become significantly faster and more accurate at producing text on a mobile device than humans are at typing on a keyboard.
Researchers from Stanford University and the University of Washington ran a study on a new program called Deep Speech 2 (developed by Chinese internet giant, Baidu) and found that this speech recognition software is faster and more accurate at producing text than human typists.
Baidu’s Deep Speech 2 is a cloud-based voice recognition software based on a deep learning neural network. Essentially, the software is able to train itself by analyzing massive datasets of real speech.
“Speech recognition is something that’s been promised to us for decades, but it has never worked very well,” said James Landay, a professor of computer science at Stanford and co-author of the study.
“But we were noticing that in the past two to three years, speech recognition was actually improving a lot, benefiting from big data and deep learning to train its neural networks to produce faster, more accurate results. So we decided to formally test it against humans.”
To test the software out, the team pitted Deep Speech 2 against 32 people between the ages of 19 and 32. The tests, which ran in both English and Mandarin Chinese, had the humans taking turns saying, and then typing, short phrases into an iPhone—phrases like “physics and chemistry are hard,” or “have a good weekend,” and “go out for some pizza and beer.” Half of the subjects typed using the QWERTY keyboard, while the other half conducted the test using iOS’ Pinyin keyboard.
In the end, machine triumphed over man. For English, the speech recognition software was three times faster with a 20.4 percent low error rate. For Mandarin Chinese, the software was 2.8 times faster with a 63.4 percent lower error rate compared to typing.
Researchers hope that this breakthrough will encourage engineers to design interfaces that will take better advantage of voice recognition technology.
“Imagine an interface where you use speech to start and then it switches to a graphical interface that you can touch and control with your finger,” Landay said.