Home >Web Front-end >JS Tutorial >Make a Voice-Controlled Audio Player with the Web Speech API

Make a Voice-Controlled Audio Player with the Web Speech API

William Shakespeare
William ShakespeareOriginal
2025-02-18 09:40:09942browse

Make a Voice-Controlled Audio Player with the Web Speech API

Core points

  • Web Voice API is a JavaScript API that allows web developers to integrate speech recognition and synthesis into their web pages, thereby enhancing the user experience, especially for people with disabilities or users who need to handle multiple tasks simultaneously.
  • Voice Recognition API Currently requires an internet connection and user permissions to access the microphone. Library such as Annyang can help manage complexity and ensure forward compatibility.
  • You can use the speech synthesis API and the speech recognition API to build voice-controlled audio players. This allows the user to navigate between songs and request specific songs using voice commands.
  • The audio player will contain settings data, UI methods, voice API methods, and audio operation methods. Codes that identify and process user input are only applicable to WebKit browsers.
  • Web Voice API has the potential to be used in many areas, such as using voice commands to browse emails, navigate websites, or search for the web. With the implementation stable and new features added, the use of this API is expected to grow.

/ Used to hide/show extra blocks / .sp_hiddenblock { margin: 2px; border: 1px solid rgb(250, 197, 82); border-radius: 3px; padding: 5px; background-color: rgba(250, 197, 82, 0.7); } .sp_hiddenblock.sp_hide { display: none !important; } This article was reviewed by Edwin Reynoso and Mark Brown. Thanks to all SitePoint peer reviewers for getting SitePoint content to its best!

Web Voice API is a JavaScript API that enables web developers to integrate speech recognition and synthesis capabilities into their web pages.

There are many reasons for this. For example, to enhance the experience of people with disabilities (especially users with visual impairment, or users with limited hand mobility), or to allow users to interact with web applications while performing other tasks, such as driving.

If you have never heard of the Web Voice API, or you want to get started quickly, it might be a good idea to read Aurelio De Rosa's articles Introduction to the Web Voice API, Voice Synthesis API, and Talking Forms idea.

Browser support

Browser manufacturers have only recently begun to implement both the voice recognition API and the voice synthesis API. As you can see, support for these APIs is far from perfect, so if you are studying this tutorial, use the right browser.

In addition, the voice recognition API currently requires an internet connection because voice will be transmitted over the network and the result will be returned to the browser. If the connection uses HTTP, the user must allow the site to use their microphone every time the request is made. If the connection uses HTTPS, you only need to do this once.

Voice Recognition Library

The

library helps us manage complexity and ensures we stay forward compatible. For example, when another browser starts supporting the voice recognition API, we don't have to worry about adding vendor prefixes.

Annyang is such a library, which is very easy to use. Learn more.

To initialize Annyang, we add their scripts to our website:

<code class="language-javascript"></code>

We can check if the API is supported like this:

<code class="language-javascript">if (annyang) { /*逻辑*/ }</code>

and add a command using an object that uses the command name as the key and the callback as the method:

<code class="language-javascript">var commands = {
  'show divs': function() {
    $('div').show();
  },
  'show forms': function() {
    $("form").show();
  }
};</code>

Finally, we just add them and start voice recognition with the following command:

<code class="language-javascript">annyang.addCommands(commands);
annyang.start();</code>

Voice-controlled audio player

In this article, we will build a voice-controlled audio player. We will use both the Speech Synthesis API (used to tell the user which song is being played, or the command is not recognized) and the Speech Recognition API (converting voice commands to strings that will trigger different application logic).

The advantage of using the audio player with the Web Voice API is that users can browse other pages in the browser or minimize the browser and perform other actions while still being able to switch between songs. If we have many songs on our playlist, we can even request a specific song without manual search (if we know its name or singer, of course).

We will not rely on third-party libraries for speech recognition, as we want to show how to use the API without adding additional dependencies to the project. Voice-controlled audio players only support browsers that support the interimResults attribute. The latest version of Chrome should be a safe choice.

As always, you can find the full code on GitHub, as well as a demo on CodePen.

Beginner — Playlist

Let's start with a static playlist. It consists of an object that contains different songs in an array. Each song is a new object containing the path to the file, the singer's name, and the name of the song:

<code class="language-javascript">var data = {
  "songs": [
    {
      "fileName": "https://www.ruse-problem.org/songs/RunningWaters.mp3",
      "singer" : "Jason Shaw",
      "songName" : "Running Waters"
    },
    ...</code>

We should be able to add new objects to the songs array and automatically include new songs into our audio player.

Audio player

Now let's look at the player itself. This will be an object containing the following:

  • Some setting data
  • Methods related to UI (such as filling song lists)
  • Methods related to Voice API (such as recognition and processing commands)
  • Methods related to audio operation (e.g. play, pause, stop, previous, next)

Set data

This is relatively simple.

<code class="language-javascript">var audioPlayer = {
  audioData: {
    currentSong: -1,
    songs: []
  },</code>
The

currentSong attribute refers to the index of the song the user is currently in. This is useful, for example, when we have to play the previous/next song or stop/pause song.

songs Array contains all songs the user has listened to. This means that the next time the user listens to the same song, we can load it from the array without downloading it.

You can view the full code here.

UI method

UI will contain a list of available commands, a list of available tracks, and a context box to notify the user of the current action and previous commands. I won't go into detail about the UI method, but provide a brief overview. You can find the code for these methods here.

load

This will iterate over the playlist we declared earlier and append the song's name, along with the artist's name, to the list of available tracks.

changeCurrentSongEffect

This indicates which song is currently playing (by marking it in green and adding a pair of headphones next to it), and which songs have been played.

playSong

This indicates that the user's song is playing or ended through the changeStatusCode method (adding this information to the box) and by notifying the user of this change through the voice API.

changeStatusCode

As mentioned above, this updates the status message in the context box (for example, indicating that a new song is being played) and uses the speak method to notify the user of this change.

changeLastCommand

A small helper function to update the last command box.

toggleSpinner

A small helper function to hide or display the spinner icon (indicating that the user's voice command is currently being processed).

Player method

The player will be responsible for what you might expect, namely: starting, stopping, and pausing playback, and moving back and forth between tracks. Again, I'm not going to go into these methods in detail, but rather I want to direct you to our GitHub code base.

Play

This checks whether the user has listened to the song. If not, it starts the song, otherwise it will only call the playSong method we discussed earlier on the currently cached song. This is in audioData.songs and corresponds to the currentSong index.

pauseSong

This pauses or stops completely (returns playback time to the beginning of the song) a song, depending on what is passed as the second parameter. It also updates the status code to notify the user that the song has been stopped or paused.

stop

This pauses or stops the song based on its first and only parameter:

prev

This checks if the previous song is cached, and if so, pauses the current song, decrements currentSong and plays the current song again. If the new song is not in the array, it does the same thing, but it first loads the song based on the file name/path corresponding to the decreasing currentSong index.

next

If the user has listened to a song before, this method will try to pause it. If the next song exists in our data object (i.e. our playlist), it loads and plays it. If there is no next song, it will just change the status code and inform the user that they have reached the last song.

searchSpecificSong

This takes the keyword as a parameter and performs a linear search between the song name and the artist, then plays the first match.

Voice API Method

The Voice API is surprisingly easy to implement. In fact, just two lines of code can make the web application talk to the user:

<code class="language-javascript"></code>

What we do here is create a utterance object with the text we want to say. The speechSynthesis interface (available on the window object) is responsible for handling this utterance object and controlling the playback of the generated voice.

Continue to try it in your browser. It's that simple!

speak

We can see its practical application in our speak method, which reads aloud the message passed as a parameter:

<code class="language-javascript">if (annyang) { /*逻辑*/ }</code>

If a second parameter (scope) exists, after the message is played, we call the scope method on play (which will be an Audio object).

processCommands

This method is not that exciting. It takes a command as an argument and calls the appropriate method to respond to it. It uses a regular expression to check if the user wants to play a specific song, otherwise, it goes into a switch statement to test different commands. If none corresponds to the received command, it informs the user that the command is not understood.

You can find its code here.

Tele everything together

So far, we have a data object representing the playlist, and a audioPlayer object representing the player itself. Now we need to write some code to identify and process user input. Please note that this applies only to WebKit browsers.

The code that makes the user talk to your app as simple as before:

<code class="language-javascript">var commands = {
  'show divs': function() {
    $('div').show();
  },
  'show forms': function() {
    $("form").show();
  }
};</code>

This will invite users to allow pages to access their microphone. If you allow access, you can start talking and when you stop, the onresult event will be triggered to make the result of the voice capture available as a JavaScript object.

Reference: HTML5 Speech Recognition API

We can implement it in our application as follows:

<code class="language-javascript">annyang.addCommands(commands);
annyang.start();</code>

As you can see, we tested the presence of webkitSpeechRecognition on the window object. If it exists, then we can start, otherwise we will tell the user that the browser does not support it. If all goes well, we then set a few options. Among them lang is an interesting option that improves the recognition results based on your origin.

Then, we declare handlers for the start and onresult events before starting the onend method.

Processing results

When the speech recognizer gets results, at least in the context of the current speech recognition implementation and our needs, we want to do a few things. Every time there is a result, we want to save it in the array and set a timeout to wait for three seconds so that the browser can collect any further results. After three seconds we want to use the collected results and loop through them in reverse order (newer results are more likely to be accurate) and check if the identified transcripts contain one of the commands we have available. If so, we execute the command and restart voice recognition. We do this because it can take up to a minute to wait for the end result, which makes our audio player look rather unresponsive and meaningless because it will be faster with just a click of a button.

<code class="language-javascript"></code>

Because we don't use the library, we have to write more code to set up our speech recognizer, loop through each result and check if its transcription matches the given keyword.

Finally, we restart it immediately at the end of speech recognition:

<code class="language-javascript">if (annyang) { /*逻辑*/ }</code>

You can view the full code for this section here.

That's it. We now have a fully functional and voice-controlled audio player. I highly recommend you download the code from GitHub and try it out, or check out the CodePen demo. I also provide a version that serves over HTTPS.

Conclusion

I hope this practical tutorial will provide a good introduction to the possibilities of the Web Voice API. I think as the implementation stabilizes and new features are added, we will see the usage of this API grow. For example, I think future YouTube will be completely voice-controlled, where we can watch videos from different users, play specific songs, and move between songs with just voice commands.

The Web Voice API can also improve many other areas or open up new possibilities. For example, use voice to browse emails, navigate websites, or search for the network.

Do you use this API in your project? I'd love to hear you in the comments below.

Frequently Asked Questions about Voice Control Audio Players Using Web Voice API (FAQ)

How does the Web Voice API work in a voice-controlled audio player?

The Web Voice API is a powerful tool that allows developers to integrate speech recognition and synthesis into their web applications. In a voice-controlled audio player, the API works by converting spoken commands into text that the application can then interpret and execute. For example, if the user says "play", the API will convert it to text, and the application will understand that this is the command to start playing audio. This process involves sophisticated algorithms and machine learning techniques to accurately identify and interpret human speech.

What are the advantages of using voice-controlled audio players?

Voice-controlled audio players have several advantages. First, it provides a hands-free experience, which is especially useful when users are busy with other tasks. Second, it can enhance accessibility for users with reduced mobility, which may have difficulty using traditional controls. Finally, it offers a novel and engaging user experience that can make your app stand out from the competition.

Can I use the Voice API in any web browser?

Most modern web browsers support the Web Voice API, including Google Chrome, Mozilla Firefox, and Microsoft Edge. However, it is always best to check specific browser compatibility before integrating APIs into your application, as support may vary between versions and platforms.

How to improve the accuracy of speech recognition in voice-controlled audio players?

You can use high-quality microphones, reduce background noise, and train APIs to better understand the user's voice and accents to improve the accuracy of speech recognition. Additionally, you can implement error handling in your application to handle unidentified commands and provide feedback to users.

Can I customize voice commands in voice-controlled audio player?

Yes, you can customize voice commands in voice-controlled audio players. This can be done by defining your own set of commands in your application code, which the Web Voice API will then recognize and interpret. This allows you to customize the user experience based on your specific needs and preferences.

Does the Web Voice API support languages ​​other than English?

Yes, the Web Voice API supports multiple languages. You can specify a language in the API settings, and it will recognize and interpret commands for that language. This makes it a universal tool for developing applications for international audiences.

How is the security of the Web Voice API?

The Web Voice API is designed with security in mind. It uses a secure HTTPS connection to transmit voice data and does not store any personal information. However, like any web technology, it is important to follow security best practices, such as regularly updating software and protecting your applications from common web vulnerabilities.

Can I use the Web Voice API in my mobile application?

While the Voice Web API is primarily designed for use in web applications, it can also be used in mobile applications through web views. However, for native mobile applications, you may want to consider using platform-specific speech recognition APIs that may provide better performance and integration.

What are the limitations of the Web Voice API?

While the Web Voice API is a powerful tool, it does have some limitations. For example, it requires an internet connection to work, and its accuracy may be affected by factors such as background noise and user accent. Additionally, API support may vary between different web browsers and platforms.

How to get started with the Voice Web API?

To get started with the Web Voice API, you need to understand the basics of JavaScript and Web development. You can then browse the API documentation that provides detailed information about their features and how to use them. There are also many online tutorials and examples available to help you learn how to integrate APIs into your own applications.

The above is the detailed content of Make a Voice-Controlled Audio Player with the Web Speech API. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn