Skip to main content

Clever new speech recognition system from MIT learns language just like a newborn child

android messages improvements phones textting one another
Olga Lebedeva/123RF.com
Speech-recognition systems may not yet be perfect, but as the likes of Amazon Echo show, they’re getting both better and more ubiquitous all the time.

A new piece of research by investigators at The Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL) suggests a new technique for training these systems — by getting them to learn by looking at images.

Recommended Videos

“This is an attempt to get machines to require less supervised training to learn about spoken language,” Jim Glass, a senior research scientist at CSAIL, told Digital Trends. “The conventional way to train speech recognition systems is by using recordings of people talking and, for each utterance, transcribing exactly what words have been said. Ideally, you have hundreds or thousands of hours of speech in order for the system to work properly. Some of the biggest companies doing this — like Baidu and Google — are using tens of thousands of hours for training. The more annotated data that they have, the better these systems perform.”

So what’s wrong with that? After all, as noted, speech-recognition tech is continuously getting better. Whatever computer scientists are doing is obviously working.

That may be true, but this new approach is interesting for a couple of reasons. Firstly, opening up the ability of a machine to train itself to understand by looking at combined images and audio (eventually, you could imagine it training by watching YouTube) is much closer to the way that we learn as human beings.

Secondly — and arguably more importantly — is the fact that it could help bring speech recognition to parts of the world that might greatly benefit from this kind of technology.

“Annotated data is expensive to produce,” Glass continued. “Speech recognition has been going on for decades and the majority of it has been for languages in countries which can afford to invest in these kind of resources. When it comes to language, it tends to be those which companies think will help them make a profit. English has received by far the most attention, followed by western European languages, and other languages like Japanese and Mandarin. The problem is that there are around 7,000 languages spoken in the world and around 300 that are spoken by more than 1 million people. A lot of these just haven’t received much attention — if any.”

In parts of the world where literacy levels are low, it’s easy to see how speech recognition could be a game changer in terms of providing people with access to information. Hopefully, this technology can help toward that goal.

As exciting as the research is, however, Glass notes it is still in its very early stages. At present, CSAIL researchers have been feeding their system with a database of 1,000 images, each with a free-form verbal description that relates to it in some way. They then test the system by giving it a recording and asking it to retrieve 10 images which best match what it is hearing.

Over time, the hope is that such approaches to speech recognition will improve in their effectiveness to the point where laborious labeling of speech training data is no longer considered a necessity.

If all goes according to plan, that should be better for everyone — whether you’re an English speaker in the U.S. or a speaker of Xhosa in South Africa.

Luke Dormehl
I'm a UK-based tech writer covering Cool Tech at Digital Trends. I've also written for Fast Company, Wired, the Guardian…
The best portable power stations
EcoFlow DELTA 2 on table at campsite for quick charging.

Affordable and efficient portable power is a necessity these days, keeping our electronic devices operational while on the go. But there are literally dozens of options to choose from, making it abundantly difficult to decide which mobile charging solution is best for you. We've sorted through countless portable power options and came up with six of the best portable power stations to keep your smartphones, tablets, laptops, and other gadgets functioning while living off the grid.
The best overall: Jackery Explorer 1000

Jackery has been a mainstay in the portable power market for several years, and today, the company continues to set the standard. With three AC outlets, two USB-A, and two USB-C plugs, you'll have plenty of options for keeping your gadgets charged.

Read more
CES 2023: HD Hyundai’s Avikus is an A.I. for autonomous boat and marine navigation
Demonstration of NeuBoat level 2 autonomous navigation system at the Fort Lauderdale International Boat Show

This content was produced in partnership with HD Hyundai.
Autonomous vehicle navigation technology is certainly nothing new and has been in the works for the better part of a decade at this point. But one of the most common forms we see and hear about is the type used to control steering in road-based vehicles. That's not the only place where technology can make a huge difference. Autonomous driving systems can offer incredible benefits to boats and marine vehicles, too, which is precisely why HD Hyundai has unveiled its Avikus AI technology -- for marine and watercraft vehicles.

More recently, HD Hyundai participated in the Fort Lauderdale International Boat Show, to demo its NeuBoat level 2 autonomous navigation system for recreational boats. The name mashes together the words "neuron" and "boat" and is quite fitting since the Avikus' A.I. navigation tech is a core component of the solution, it will handle self-recognition, real-time decisions, and controls when on the water. Of course, there are a lot of things happening behind the scenes with HD Hyundai's autonomous navigation solution, which we'll dive into below -- HD Hyundai will also be introducing more about the tech at CES 2023.

Read more
This AI cloned my voice using just three minutes of audio
acapela group voice cloning ad

There's a scene in Mission Impossible 3 that you might recall. In it, our hero Ethan Hunt (Tom Cruise) tackles the movie's villain, holds him at gunpoint, and forces him to read a bizarre series of sentences aloud.

"The pleasure of Busby's company is what I most enjoy," he reluctantly reads. "He put a tack on Miss Yancy's chair, and she called him a horrible boy. At the end of the month, he was flinging two kittens across the width of the room ..."

Read more