Home > Article > Backend Development > Application of XML in speech synthesis
The Internet and everything related to it seems to be everywhere now. You may have received a voice call from a late-night telemarketer or received a prescription notification from your local pharmacy. Now, there is a new technology that can use speech synthesis combined with xml technology to transmit voice information.
The method of transmitting information by voice is not a new thing. It is a method of communication that we have used for thousands of years. And receiving phone calls from a computer is nothing new. Many voice technologies are now popular, from fax machines and autodialers to integrated voice response systems (IVR). The telephone is of course its most common application.
Traditional speech systems use pre-recorded samples, dictionaries and phonemes to create the sounds we hear. However, there are many problems with using this pre-recorded approach. One of the most common problems is a lack of consistency and variety. If there is only one recorded version of speech, with only one sample of each word or sound, it is difficult to get a computer to produce questions with a different intonation than ordinary declarative sentences. Equally difficult is getting a computer to know when to use a certain intonation or what intonation to pronounce.
To help solve speech synthesis problems, W3C has created a new working draft for Speech Synthesis Markup Language. This new XML vocabulary enables speech browser developers to control how a speech synthesizer is created. For example, developers can include commands into volume and use it when synthesizing speech patterns.
The SSML specification is based on an early research work by Sun called jspeeck Markup Language (JSML). JSML is based on java Speech API Markup Language. SSML is now a working paper of the W3C Speech Research Working Group.
The basic goal of the SSML language is a text-to-speech (Text-To-Speech for short TTS) processor. A TTS engine takes a collection of text and converts it to speech. There are already several TTS applications, such as telephone speech synthesis reply systems, as well as more advanced systems designed for blind people, etc. The inherent uncertainty in the pronunciation of a specific text collection is one of the main difficulties faced by existing TTS systems. Other common problems focus on the pronunciation of parts of speech such as word abbreviations (such as HTML) and words with different spellings and pronunciations (such as subpoena).
The basic elements of the SSML language specify the format of the text. For example, compared to HTML, the SSML language provides a paragraph element and goes further. Because it also provides sentence elements. By specifying the address of a sentence like a paragraph, including the starting address and ending address, the TTS engine can generate speech more accurately.
In addition to the basic format, SSML also provides functions to specify how to send a predetermined word or set of words. This functionality is implemented by the "say-as" element. It is a very useful component in SSML. It lets you specify a template that describes how to pronounce a word or set of words. With "say-as" we can specify how to pronounce abbreviated words, as well as specify the pronunciation for words that are spelled differently than they are pronounced. We can also list the differences between numbers and dates. The "say-as" element includes support for email addresses, currencies, phone numbers, etc.
We can also provide a phonetic expression for the text. For example, we can use this method to point out the difference in pronunciation of the word potato between American English and British English.
Several advanced attributes of SSML language can help us make the TTS system generate more humane sounds. We can use the "voice" element to specify a male, female, or neutral voice, and we can also specify the age to which the voice belongs. We can use this element to specify any sound from a 4-year-old boy to a 75-year-old woman.
We can also use the "emphasis" element to surround text that needs to be emphasized or is less important. We can also use the "break" element to tell the system where the speech should pause.
One of the most advanced features of the SSML language is reflected in its "PRosody" element. Through it we can generate the speech of a certain text collection in a specified way. We can specify the intonation, range, and speaking rate of the voice (words per minute). We can even specify something more detailed by using the "contour" element. The "contour" element integrates intonation and speaking speed. By specifying the value of the "contour" element of a text collection, we can more precisely define how speech is generated.
The above is the content of the application of XML in speech synthesis. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!