Home >Web Front-end >JS Tutorial >Talking Web Pages and the Speech Synthesis API

Talking Web Pages and the Speech Synthesis API

William Shakespeare
William ShakespeareOriginal
2025-02-22 09:23:13540browse

Talking Web Pages and the Speech Synthesis API

Core points

  • The voice synthesis API allows the website to provide information to users by reading text aloud, which can greatly help visually impaired users and multitasking users.
  • The Voice Synthesis API provides a variety of methods and attributes to customize speech output, such as language, speech speed, and tone. This API also includes methods to start, pause, resume, and stop the speech synthesis process.
  • At present, the voice synthesis API is only fully supported by Chrome 33, and partially supports the Safari browser for iOS 7. This API requires wider browser support to be practically applied on the website.

A few weeks ago, I briefly discussed NLP and its related technologies. When dealing with natural language, two distinct but complementary aspects need to be considered: automatic speech recognition (ASR) and text-to-speech (TTS). In an article introducing the Web Voice API, I discussed the Web Voice API, an API that provides voice input and text-to-speech output capabilities in a web browser. You may have noticed that I only covered how to implement voice recognition on a website, not voice synthesis. In this article, we will fill this gap and describe the speech synthesis API. Voice recognition provides an opportunity to provide information to the website for users, especially those with disabilities. Recall the use cases I emphasize: > On a website, users can use voice navigation pages or fill in form fields. Users can also interact with the page while driving without taking their eyes off the road. None of these are trivial use cases.

Therefore, we can think of it as a channel from the user to the website. Phonetic synthesis, on the contrary, enables the website to provide information to users by reading text aloud. This is especially useful for people with blindness and often those with visual impairment. There are as many use cases for speech synthesis as speech recognition. Think of some systems implemented in new cars that can read your text or email so you don't have to take your eyes off the road. Visually impaired people using computers are familiar with software like JAWS, which can read aloud anything displayed on the desktop, allowing them to perform tasks. These apps are great, but they are expensive. With the voice synthesis API, we can help people who use our website, regardless of whether they have a disability or not. For example, suppose you are writing a blog post (like I am doing now), and to make it readable, you split it into paragraphs. Isn't this a good opportunity to use a speech synthesis API? In fact, we can program our website so that the speaker's icon appears on the screen once the user hovers over (or focuses) the text. If the user clicks the icon, we will call a function to synthesize the text of the given paragraph. This is a non-trivial improvement. Even better, it has very low overhead for us as developers and no overhead for our users. The basic implementation of this concept is shown below. Voice Synthesis API Demonstration Now we have a better understanding of the use cases of this API, allowing us to understand its methods and properties. The Method and Attribute Speech Synthesis API defines an interface called SpeechSynthesis, whose structure is shown here. As in the previous article, this article does not cover all properties and methods described in the specification. The reason is that it is too complex to cover it in one article. However, we will explain enough elements to make it easy for you to understand the elements that are not covered. ### SpeechSynthesisUtterance object The first object we need to know is the SpeechSynthesisUtterance object. It represents the pronunciation (i.e. text) that the synthesizer will read aloud. This object is very flexible and can be customized in a variety of ways. In addition to text, we can also set the language, speech speed and even tone used to pronounce text. Here is its attribute list: - text – A string that specifies the speech (text) to be synthesized. - lang – A string representing a speech synthesis language (such as "en-GB" or "it-IT"). - voiceURI – A string that specifies the address of the speech synthesis service that the web application wants to use. - volume – A number that represents the volume of the text. It ranges from 0 (minimum) to 1 (maximum) (including), and the default value is 1. - rate – The number that represents the speed of speech. It is relative to the default rate of speech. The default value is 1. A value of 2 means that the speech will be read aloud at twice the default speed. Values ​​below 0.1 or above 10 are not allowed. - pitch – The number representing the tone of the voice. It ranges from 0 (minimum) to 2 (maximum) (inclusive). The default value is 1. To instantiate this object, we can pass the text to be synthesized as a constructor parameter, or omit the text and set it later. The following code is an example of the first case.// 创建语音对象var utterance = new SpeechSynthesisUtterance('My name is Aurelio De Rosa');The second case is to construct SpeechSynthesisUtterance and assign parameters as shown below. // 创建语音对象var utterance = new SpeechSynthesisUtterance();utterance.text = 'My name is Aurelio De Rosa';utterance.lang = 'it-IT';utterance.rate = 1.2; Some methods exposed by this object are: - onstart – Set the callback that is triggered at the start of the synthesis. - onpause – Sets the callback triggered when the speech synthesis is paused. - onresume – Sets the callback that is triggered when the composition is restored. - oneend – Sets the callback triggered at the end of the composition. The SpeechSynthesisUtterance object allows us to set the text to be read aloud and configure how it is read aloud. Currently, we have only created objects representing speech. We still need to bind it to the synthesizer. ### SpeechSynthesis object The SpeechSynthesis object does not need to be instantiated. It belongs to a window object and can be used directly. This object exposes some methods, such as: - speak() - accepts SpeechSynthesisUtterance object as its only parameter. This method is used to synthesize speech. - stop() – Stop the synthesis process immediately. - pause() – Pause the synthesis process. - resume() – Resuming the synthesis process. Another interesting way is getVoices(). It does not accept any parameters and is used to retrieve a list of voices (arrays) available to a specific browser. Each entry in the list provides information such as name, mnemonic name (providing voice prompts to developers such as "Google US English", lang (the language of voice, such as it-IT), and voiceURI (this voice is address of voice synthesis service). Important Note: In Chrome and Safari, the voiceURI attribute is called voice. Therefore, the demo we will build in this article uses voice instead of voiceURI. Browser compatibility Unfortunately, at the time of writing, the only browsers that support the voice synthesis API are Chrome 33 (full support) and iOS 7 (partially supported). Demo This section provides a simple demonstration of the speech synthesis API. This page allows you to enter some text and synthesize it. In addition, you can set the rate, tone, and language you want to use. You can also stop, pause, or resume synthesis of text at any time using the corresponding buttons provided. Before attaching the listener to the button, we tested the implementation because support for this API is very limited. Generally, the test is very simple, including the following code: if (window.SpeechSynthesisUtterance === undefined) { // 不支持} else { // 读取我的文本} If the test fails, we will display the message "API does not support".Once support is verified, we dynamically load available voices in the specific selection box placed in the tag. Note that there is a problem with the getVoices() method in Chrome (#340160). So I created a workaround for this using setInterval(). We then attach a handler to each button so that they can call their specific actions (play, stop, etc.). A live demonstration of the code is provided here. Additionally, this demo, and all the other demos I have built so far, can be found in my HTML5 API demo repository. ```

charset="UTF-8"> name="viewport" content="width=device-width, initial-scale=1.0"/>

>Speech Synthesis API Demo>
  • { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; }
<code>  body
  {
    max-width: 500px;
    margin: 2em auto;
    padding: 0 0.5em;
    font-size: 20px;
  }

  h1,
  .buttons-wrapper
  {
    text-align: center;
  }

  .hidden
  {
    display: none;
  }

  #text,
  #log
  {
    display: block;
    width: 100%;
    height: 5em;
    overflow-y: scroll;
    border: 1px solid #333333;
    line-height: 1.3em;
  }

  .field-wrapper
  {
    margin-top: 0.2em;
  }

  .button-demo
  {
    padding: 0.5em;
    display: inline-block;
    margin: 1em auto;
  }
></code>

>

Speech Synthesis API>
<code><h3>></h3>Play area>
 action="" method="get">
  <label> for="text"></label>Text:>
   id="text">>
  <div> class="field-wrapper">
    <label> for="voice"></label>Voice:>
     id="voice">>
  </div>>
  <div> class="field-wrapper">
    <label> for="rate"></label>Rate (0.1 - 10):>
     type="number" id="rate" min="0.1" max="10" value="1" step="any" />
  </div>>
  <div> class="field-wrapper">
    <label> for="pitch"></label>Pitch (0.1 - 2):>
     type="number" id="pitch" min="0.1" max="2" value="1" step="any" />
  </div>>
  <div> class="buttons-wrapper">
     id="button-speak-ss" class="button-demo">Speak>
     id="button-stop-ss" class="button-demo">Stop>
     id="button-pause-ss" class="button-demo">Pause>
     id="button-resume-ss" class="button-demo">Resume>
  </div>>
>

 id="ss-unsupported" class="hidden">API not supported>

<h3>></h3>Log>
<div> id="log"></div>>
 id="clear-all" class="button-demo">Clear all>

>
  // Test browser support
  if (window.SpeechSynthesisUtterance === undefined) {
    document.getElementById('ss-unsupported').classList.remove('hidden');
    ['button-speak-ss', 'button-stop-ss', 'button-pause-ss', 'button-resume-ss'].forEach(function(elementId) {
      document.getElementById(elementId).setAttribute('disabled', 'disabled');
    });
  } else {
    var text = document.getElementById('text');
    var voices = document.getElementById('voice');
    var rate = document.getElementById('rate');
    var pitch = document.getElementById('pitch');
    var log = document.getElementById('log');

    // Workaround for a Chrome issue (#340160 - https://code.google.com/p/chromium/issues/detail?id=340160)
    var watch = setInterval(function() {
      // Load all voices available
      var voicesAvailable = speechSynthesis.getVoices();

      if (voicesAvailable.length !== 0) {
        for(var i = 0; i               voices.innerHTML += '                                  'data-voice-uri="' + voicesAvailable[i].voiceURI + '">' +
                              voicesAvailable[i].name +
                              (voicesAvailable[i].default ? ' (default)' : '') + '';
        }

        clearInterval(watch);
      }
    }, 1);

    document.getElementById('button-speak-ss').addEventListener('click', function(event) {
      event.preventDefault();

      var selectedVoice = voices.options[voices.selectedIndex];

      // Create the utterance object setting the chosen parameters
      var utterance = new SpeechSynthesisUtterance();

      utterance.text = text.value;
      utterance.voice = selectedVoice.getAttribute('data-voice-uri');
      utterance.lang = selectedVoice.value;
      utterance.rate = rate.value;
      utterance.pitch = pitch.value;

      utterance.onstart = function() {
        log.innerHTML = 'Speaker started' + '<br>' + log.innerHTML;
      };

      utterance.onend = function() {
        log.innerHTML = 'Speaker finished' + '<br>' + log.innerHTML;
      };

      window.speechSynthesis.speak(utterance);
    });

    document.getElementById('button-stop-ss').addEventListener('click', function(event) {
      event.preventDefault();

      window.speechSynthesis.cancel();
      log.innerHTML = 'Speaker stopped' + '<br>' + log.innerHTML;
    });

    document.getElementById('button-pause-ss').addEventListener('click', function(event) {
      event.preventDefault();

      window.speechSynthesis.pause();
      log.innerHTML = 'Speaker paused' + '<br>' + log.innerHTML;
    });

    document.getElementById('button-resume-ss').addEventListener('click', function(event) {
      event.preventDefault();

      if (window.speechSynthesis.paused === true) {
        window.speechSynthesis.resume();
        log.innerHTML = 'Speaker resumed' + '<br>' + log.innerHTML;
      } else {
        log.innerHTML = 'Unable to resume. Speaker is not paused.' + '<br>' + log.innerHTML;
      }
    });

    document.getElementById('clear-all').addEventListener('click', function() {
      log.textContent = '';
    });
  }
></code>

Conclusion

This article introduces the speech synthesis API. This is an API that synthesizes text and improves the overall experience of our website users, especially visually impaired users. As we can see, this API exposes multiple objects, methods, and properties, but it is not difficult to use. Unfortunately, its browser support is currently very poor, with Chrome and Safari being the only browsers that support it. Hopefully more browsers will follow suit and allow you to actually consider using it on your website. I've decided to do that. Don't forget to play the demo, if you like this post, please leave a comment. I really want to hear your opinions. Frequently Asked Questions about Web Pages and Voice Synthesis APIs (FAQ)

What is the speech synthesis API and how does it work?

The Voice Synthesis API is a web-based interface that allows developers to integrate text-to-speech functionality into their applications. It works by converting written text into spoken words using computer-generated voice. This is done by breaking the text into speech components and then synthesizing these components into speech. The API provides a range of voice and languages ​​to choose from, allowing developers to customize voice output to suit their needs.

How do I implement the speech synthesis API in a web application?

Implementing the speech synthesis API in your web application involves several steps. First, you need to create a new SpeechSynthesisUtterance instance and set its text property to the text you want to read aloud. You can then set other properties, such as speech, tone, and rate, to customize the speech output. Finally, call the SpeechSynthesis interface's spoke method to start speech synthesis.

Can I customize the voice and language of voice output?

Yes, the speech synthesis API provides a range of speech and languages ​​you can choose from. You can set the voice and language by setting the voice and lang properties of the SpeechSynthesisUtterance instance. The API also allows you to adjust the tone and rate of your voice to further customize the output.

What are the limitations of the speech synthesis API?

While the speech synthesis API is a powerful tool, it does have some limitations. For example, voice and language availability may vary by browser and operating system. Additionally, the quality of voice output may vary and may not always sound natural. Furthermore, this API does not provide control over the pronunciation of a particular word or phrase.

How to handle errors when using speech synthesis API?

The Voice Synthesis API provides an error event that you can listen to. This event is triggered when an error occurs during speech synthesis. You can handle this event by adding an event listener to the SpeechSynthesisUtterance instance and providing a callback function that will be called when the event is triggered.

Can I pause and resume voice output?

Yes, the Voice Synthesis API provides pause and recovery methods that you can use to control your voice output. You can call these methods on the SpeechSynthesis interface to pause and restore voice.

Is the voice synthesis API supported in all browsers?

The voice synthesis API is supported in most modern browsers, including Chrome, Firefox, Safari, and Edge. However, voice and language availability may vary by browser and operating system.

Can I use the voice synthesis API in my mobile application?

Yes, the voice synthesis API can be used in mobile applications. However, voice and language availability may vary by mobile operating system.

How to test the speech synthesis API?

You can test the speech synthesis API by creating a simple web page that converts written text into speech using the API. You can then try different voices, languages, tones and rates to see how they affect the voice output.

Where can I find more information about the Voice Synthesis API?

You can find more information about the Voice Synthesis API in the official documentation provided by the World Wide Web Alliance (W3C). There are also many online tutorials and articles that provide detailed explanations and examples on how to use the API.

The above is the detailed content of Talking Web Pages and the Speech Synthesis API. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn