Friday , June 18 2021

Computer voice recognition is still learning to find out who's talking

A better understanding of how people find different voices can help create better voice recognition software.

(Inside Science) – If your phone rings and answers without looking at the caller ID, it's quite possible that before the guy on the other end says hello, you already know it was your mother. Within seconds, you can tell if she is happy, sad, angry or anxious.

People can naturally recognize and identify other people through their voices. A new study, published in The Journal of the American Acoustic Society we explored exactly how people are able to do that. The results can help researchers create more effective voice recognition software in the future.

The complexity of speech

"It's a crazy problem for our hearing system to figure out – how many sounds are, what they are and where they are," said Tyler Perchachonne, a neuroscientist and linguist at Boston University who did not participate in the study.

Nowadays, Facebook has little trouble identifying faces of photos, even when the person is represented by different angles or under different lights. Nowadays, voice recognition software is much more limited compared to Perachione, and this may be related to our lack of understanding of how people can identify voices.

"We, people, have different models of speakers for different individuals," says Neerja Sharma, a psychologist at Carnegie Mellon University in Pittsburgh, and lead author of the recent study. "When you listen to a conversation, you switch between different patterns in your brain so you can understand each speaker better."

People develop speaker patterns in their brains because they are exposed to different voices, taking into account subtle differences in features like rhythm and timbre. By naturally switching and adapting between different speaker models based on what they speak, people learn to identify and understand different speakers.

"At the moment, voice recognition systems do not focus on the speaker aspect – they basically use the same speaker model to analyze everything," Sharma said. "For example, when you talk to Alexa, she uses the same speaker model to analyze my speech against your speech."

So let's say you have a pretty thick Alabaan accent – Alexa may think you say a "stick" when you're trying to say "can not."

"If we can understand how people use speaker-dependent models, then perhaps we can learn the machine system to do it," says Sharma.

Listen and say "when"

In the new study, Sharma and his colleagues developed an experiment in which a group of volunteers listened to two-voice audio that spoke in turn, and were asked to identify the exact moment a speaker took over from the previous one.

This allowed researchers to explore the relationship between some audio features and response time and the level of false alarms of volunteers. Then they began to decipher what signals people were listening to to show the change of the speaker.

"We currently do not have many different experiments to allow us to study the speaker's identification or voice recognition, so this experimental design is actually smart," Perkayone said.

When researchers conducted the same test for several different types of voice recognition software, including one commercially available software developed by IBM, they found that volunteers did better than all tested programs as expected,

Sharma says they plan to look at the brain activity of people listening to different voices using electroencephalography or the EEG, a non-invasive method to monitor brain activity. "This can help us further analyze the way the brain reacts when there is a change in the speaker," he said.

Source link