Home > Article > Technology peripherals > Linguists are back! Start learning from “pronunciation”: this time the AI model has to teach itself
Trying to make computers understand human language has always been an insurmountable difficulty in the field of artificial intelligence.
Early natural language processing models usually used artificially designed features, requiring specialized linguists to manually write patterns. However, the final results were not ideal, and even AI research once fell into a cold winter.
Every time I fire a linguist, the speech recognition system becomes more accurate.
Every time I fire a linguist, the performance of the speech recognizer goes up.
——Frederick Jelinek
With statistical model and large-scale pre-training After the model is built, feature extraction is no longer necessary, but data annotation for specified tasks is still required, and the most critical problem is: the trained model still does not understand human language.
#So, should we start from the original form of language and re-study: How did humans acquire language ability?
Researchers from Cornell University, MIT and McGill University recently published a paper in Nature Communications, proposing a framework for algorithmic synthesis models, in the most basic part of human language, That is, morpho-phonology began to teach AI to learn language and construct the morphology of the language directly from sounds.
Paper link: https://www.nature.com/articles/s41467-022-32012-w
Morphology and phonology is linguistics One of the branches focuses on the sound changes that occur when morphemes (the smallest units of meaning) are combined into words, and attempts to provide a series of rules to predict the regular sound changes of phonemes in language.
For example, the plural morpheme in English is written as -s or -es, but there are three pronunciations [s], [z] and [әz]. For example, the pronunciation of cats is /kæts/, and the pronunciation of dogs is It is /dagz/, and horses is pronounced /hɔrsәz/.
When humans learn to convert plural pronunciation, they first realize that the plural suffix is actually /z/ based on morphology; then based on phonology, the suffix is based on the pronunciation in the stem , such as unvoiced consonants, etc. are converted into /s/ or /әz/
Other languages also have the same phonemic and morphological rules. The researchers studied the phonemic textbooks of 58 languages 70 data sets were collected, each containing only dozens to hundreds of words and only a few grammatical phenomena. The experiment showed that the method of finding grammatical structures in natural language can also simulate the process of infants learning language.
By performing hierarchical Bayesian inference on these language data sets, the researchers found that the model can acquire new morphophonemic rules from just one or a few examples, and Able to extract common cross-language patterns and express them in compact, human-understandable form.
Human intelligence is mainly reflected in the ability to establish a theory of cognitive world. For example, after the formation of natural language, linguists summarized a set of rules to Help children learn specific languages more quickly, but current AI models cannot summarize the rules and form a theoretical framework that others can understand.
Before building a model, we need to solve a core problem: "How to describe a word." For example, the learning process of a word includes understanding the concept, intention, usage, pronunciation, and meaning of the word.
When building the vocabulary, the researchers expressed each word as a pair, for example, open is expressed as εn/, [stem: OPEN]>, the past tense is expressed as /, [tense: PAST]>, and the combined opened is expressed as εnd/, [stem: OPEN, [tense: PAST]]>
After having the data set, the researchers built a model to explain the generation of grammatical rules on a set of pair sets through maximum posterior probability inference to explain word changes.
In the representation of sounds, phonemes (atomic sounds) are represented as vectors of binary features, such as /m/, /n/, which are nasal sounds, and then based on the The feature space defines speech rules.
The researchers use the classic rule expression method, that is, context-dependent memory, sometimes also called SPE-style rules, which are widely used in the representation of sound patterns of English. .
(focus)→(structural_change)/(left_trigger)_(right_trigger), which means that as long as the left/right trigger environment is close to the left/right of focus, The focus phoneme will be converted according to structural changes.
The trigger environment specifies the connection of features (representing the set of phonemes). For example, in English, as long as the phoneme on the left is [-sonorant], the pronunciation at the end of the word is It will change from /d/ to /t/, and the writing rule is [-sonorant] → [-voice]/[-voice -sonorant]_#. For example, after walking applies this rule, the pronunciation changes from /wɔkd/ to /wɔkt/.
When such rules are constrained not to apply cyclically to their own outputs, the rules and lexics correspond to 2-way rational functions, which in turn correspond to finite state converters. -state transductions). It has been argued that the space of finite state converters is expressive enough to cover known empirical phenomena in morphophonetics and represents a limit on the descriptive power of practical uses of phonetic theory.
To learn this grammar, researchers used the Bayesian Program Learning (BPL) method. Model each grammar rule T as a program in a programming language that captures the domain-specific constraints of the problem space. The language structure common to all languages is called universal grammar. This approach can be seen as a modern instance of a long-standing approach in linguistics and employs human-understandable generative representations to formalize universal grammar.
After defining the problem that BPL needs to solve, the search space of all programs is infinite, and no guidance is given on how to solve this problem, and there is a lack of information like In the case of local stationarity exploited by local optimization algorithms such as gradient descent or Markov chain Monte Carlo, the researchers adopted a constraint-based program synthesis strategy to transform the optimization problem into a combinatorial constraint satisfaction problem and use Boolean satisfiability (SAT) solver to solve.
These solvers implement an exhaustive but relatively efficient search and guarantee that an optimal solution will be found if there is enough time. The smallest grammar that is consistent with some data can be solved using the Sketch procedural synthesizer, but must comply with the upper limit of the grammar size.
But in practice, the exhaustive search techniques used by SAT solvers cannot scale to the massive amounts of rules required to interpret large corpora.
To scale the solver to large and complex theories, the researchers took inspiration from a fundamental feature of children acquiring language and scientists building theories.
Children do not learn language overnight, but gradually enrich their grasp of grammar and vocabulary through intermediate stages of language development. Likewise, a complex scientific theory may begin with a simple conceptual core and then gradually develop to encompass an increasing number of linguistic phenomena.
Based on the above ideas, the researchers designed a program synthesis algorithm, starting from a small program, and then repeatedly using the SAT solver to find small modification points so that it can explain more and more data . Specifically, find a counterexample to the current theory and then use a solver to exhaustively explore the space of all small modifications to the theory that can accommodate this counterexample.
##
But this heuristic method lacks the integrity guarantee of SAT solver: although it repeatedly calls a complete and accurate SAT solver, it does not guarantee to find an optimal solution, but each repeated call is better than directly Optimizing the entire data is much harder. Because constraining each new theory to be close to its previous theory in theory space results in a polynomial shrinkage of the constraint satisfaction problem, the search time increases exponentially, and the SAT solver in the worst case is exponentially .
In the experimental evaluation phase, the researchers collected 70 questions from linguistics textbooks, each of which required a comprehensive analysis of some form of theory in natural language. The problems range in difficulty and cover a wide variety of natural language phenomena.
Natural languages are also diverse, including tonal languages. For example, in Kerewe (a Bantu language in Tanzania), to count is /kubala/, but to count it is /kukíbála/, where Stress marks high pitches.
There are also languages with vowel harmony. For example, Turkey has /el/ and /t∫an/, which respectively represent hands and bells, as well as /el-ler/ and /t∫an-lar/. , representing the plurals of hands and clocks respectively; there are many other linguistic phenomena, such as assimilation and extensional forms.
#In evaluation, we first measure the model’s ability to discover the correct vocabulary. Compared to ground-truth vocabularies, the model found syntax that correctly matched the entire vocabulary of the question in 60% of the benchmarks and correctly interpreted a large portion of the vocabulary in 79% of the questions.
Typically, the correct vocabulary for each problem is more specific than the correct rules, and any rules that produce complete data from the correct vocabulary must be consistent with what the model is likely to propose. Any underlying rules of have observational equivalence. Therefore, consistency with the underlying truth lexicon should be used as a metric to measure whether the synchronized rules behave correctly on the data, and this evaluation is related to the quality of the rules.
To test this hypothesis, the researchers randomly selected 15 questions and consulted with a professional linguist to score the discovered rules. Recall (the proportion of actual phonetic rules that were correctly recovered) and precision (the proportion of recovered rules that actually occurred) were measured simultaneously. Under the indicators of precision and recall, it can be found that the accuracy of the rules is positively correlated with the accuracy of the vocabulary.
When the system gets all the lexicon correct, it rarely introduces irrelevant rules (high precision) and almost always gets all the correct rules (high recall Rate).
The above is the detailed content of Linguists are back! Start learning from “pronunciation”: this time the AI model has to teach itself. For more information, please follow other related articles on the PHP Chinese website!