Conversational Guide to Automatic Speech Recognition

Q: When Was the First Machine that Recognised Voice Created?

The first machine that recognised voice was created in 1952.

Welcome to our conversational guide to automatic speech recognition. You're about to learn all the essential information about this fantastic technology. Feel free to read the contents of this guide in order or jump straight to the section that sparks your interest. Here's a list of topics covered in this guide:

Genesis of Automatic Speech Recognition
Definition of Automatic Speech Recognition
The Way Automatic Speech Recognition Works
Effectiveness of Automatic Speech Recognition
Challenges of Automatic Speech Recognition
Applications of Automatic Speech Recognition
Conclusion

What’s the Genesis of Speech Recognition?

Here's a list of several essential facts that mark the genesis of speech recognition:

Name of the First Machine that Recognised Human Speech
Creation Date of the First Machine that Recognised Human Speech
Location Where the First Machine that Recognised Human Speech Was Created
Inventors of the First Machine that Recognised Human Speech
The Way the First Machine that Recognised Human Speech Worked
Effectiveness of the First Machine that Recognised Human Speech
Impact of the First Machine that Recognised Human Speech

What’s the Name of the First Machine that Recognised Human Speech?

The first machine that recognised human speech was called “Audrey”, the Automatic Digit Recognition machine.

When Was the First Machine that Recognised Voice Created?

The first machine that recognised voice was created in 1952.

Where Was the First Machine that Recognised Human Speech Created?

The first machine that recognised human speech was created in Bell Labs.

Who Invented the First Machine that Recognised Voice?

Three researchers invented the first machine that recognised human speech: Stephen Balashek, R. Biddulph, and K. H. Davis.

How Did the First Machine that Recognised Human Speech Work?

The first machine that recognised human speech was a single-speaker system that worked by locating the local maximums in the power spectrum of each utterance generated by the acoustic resonance of the human vocal tract. It was able to recognise the fundamental units of speech — phonemes.

How Effective Was the First Machine that Recognised Voice?

The first machine that recognised voice was able to recognise single digits separated by pauses spoken by designated people (meaning that it didn’t recognise input from everyone). Needless to say that it wasn’t much use outside the lab.

What Was the Impact of the First Machine that Recognised Human Speech?

The results of the first machine that recognised human speech were meaningful enough to sustain the research, which led to expanding the number of words that were understood and paved the way for continuous speech recognition.

What’s Automatic Speech Recognition (ASR)?

Automatic speech recognition (ASR), also known as computer speech recognition or speech-to-text (STT), is an interdisciplinary study of spoken language spanning computer science and linguistics, focused on recognising spoken words using technology and converting them into text. The reverse process is called speech synthesis.

How Does Automatic Speech Recognition Work?

Automatic speech recognition breaks down spoken input into bits it can interpret, converts them into a digital format, and analyses each piece to hypothesise what the user is saying based on the spelling style and speech patterns present in training data.

What’s a Spelling Style?

Spelling style describes a strategy used by an automatic speech recognition engine to recognise and disambiguate words and letters that might sound similar but are spelt differently and have a different meaning. Spelling styles define how ASR interprets spoken input from the user in one of the following ways:

Default Spelling Style
Spell By Letter Spelling Style
Spell by Word Spelling Style

What’s a Default Spelling Style?

Default spelling style enables ASR to capture and recognise complete words in the spoken input and works well in most situations when users use common terms and an expected pronunciation for their locale.

What’s a Spell by Letter Spelling Style?

Spell by letter spelling style helps ASR recognise user-specific details and other uncommon words by changing how speech recognition interprets spoken input from whole phrases to single letters. Each letter is identified separately and resolved into a complete word after the user spells all of them.

What’s a Spell by Word Spelling Style?

Spell by word helps ASR to disambiguate between similar sounding letters by enabling users to follow up pronouncing letters with examples of words that start with this letter using one of the following formats: "as in", "for", and "like". For example:

"a" as in alpha
"a" for alpha
"a" like alpha

How Effective Is Automatic Speech Recognition?

Although some statistics show that modern ASR has greater than 90% accuracy, it’s worth keeping in mind that this is under ideal conditions, including a person with a clear voice in a quiet room with a good microphone. However, real-life scenarios pose several challenges that can decrease speech recognition accuracy.

What Are the Challenges of Automatic Speech Recognition?

While the effectiveness of automatic speech recognition seems pretty compelling, it's not free from the limitations of an audio-based medium further amplified in everyday situations that often happen for many people. Some of the examples include:

Background Noise
Side Speech
Uncommon Words

Why Is Background Noise Challenging for Automatic Speech Recognition?

Noise is one of the most difficult challenges for automatic speech recognition because it impacts the signal-to-noise ratio. It can effectively block automated speech recognition capabilities when it exceeds a specific limit.

How to Improve Speech Recognition in a Noisy Environment?

There's no better solution to improve speech recognition accuracy during a voice interaction than instructing the user to find a quieter place to continue the conversation.

Why is Side Speech Challenging for Automatic Speech Recognition?

When the user is talking simultaneously to a friend and coworker, as an aside, while the automatic speech recognition system is listening, there’s no way for it to understand that this spoken input is irrelevant and shouldn’t be processed.

How to Mitigate Problems Caused by Side Speech to Improve Speech Recognition?

One of the best ways to mitigate the issues with side speech is to instruct the user to mute the microphone when their words aren’t meant to be recognised.

Why Are Uncommon Words Challenging for Automatic Speech Recognition?

Names, spelling, domain-specific vocabulary, and alphanumeric strings are among the most challenging phrases to recognise, even for human beings. ASR gets confused with words that are difficult to pronounce or uncommon at a given locale because of insufficient data used to train the recognition model.

How to Improve Recognition Accuracy of Uncommon Words?

While recognition accuracy for uncommon or difficult-to-pronounce words is one of the biggest challenges of automatic speech recognition, the following methods can help to achieve better results in these situations:

Using Spelling Styles to Improve Speech Recognition Accuracy
Using Custom Vocabularies to Improve Speech Recognition Accuracy

How to Use Spelling Styles to Improve Speech Recognition Accuracy for Uncommon Words?

In most cases, it’s best to start with an attempt to recognise complete phrases using the default spelling style. For incorrectly captured inputs, it’s beneficial to re-prompt the user and change the spelling style to either spell-by-letter or spell-by-word.

How to Use Custom Vocabularies to Improve Speech Recognition Accuracy for Uncommon Words?

Creating custom vocabulary can help to improve speech recognition accuracy in audio conversations by helping to disambiguate inputs that although sound similar aren’t relevant. It’s especially useful for domain-specific terminology that’s uncommon.

What Are the Applications of Automatic Speech Recognition?

Automatic speech recognition is an essential component of automated voice interfaces like voice bots and multimodal bots, allowing not only to convert speech to text that enables further processing and understanding of the meaning behind the words but also enables them to do things like:

Voice Activity Detection
End-of-Speech Detection
Speaker Identification

What’s Voice Activity Detection (VAD)?

Voice activity detection (also known as speech activity detection or speech detection) is a technology that enables detecting moments when users speak.

How Does Voice Activity Detection Work?

Voice activity detection works by:

Reducing the background noise of the audio input
Measuring and quantifying the input characteristics
Comparing measurements against predefined thresholds
Classifying audio input as speech or non-speech

What’s End-of-Speech Detection?

End-of-speech detection is a technology that analyses input audio streams to find the best moment to mark the end of each conversation turn.

How Does End-of-Speech Detection Work?

End-of-speech detection works by measuring the length of the pauses in what the user is saying and comparing them with predefined limits to decide when the user has finished speaking.

What’s Speaker Identification?

Speaker identification, also known as speaker recognition, is a technology that enables identifying different speakers in an audio stream.

How Does Speaker Identification Work?

Speaker identification labels speaker utterances by analysing the characteristics of voices present in an input audio stream.

Conclusion

Speech recognition has come a long way from the first machine that recognised voice to the modern version of automatic speech recognition. Besides effectively identifying the spoken language and converting it to text, it has various other applications, including voice activity detection, end-of-speech detection, and speaker identification. These capabilities are fundamental for creating conversational interfaces that rely on spoken languages, such as voice-only and multimodal bots.