Ambiguity and sharing in natural language analysis
Ambiguity and sharing
Given the generality of your question, I am trying to match this community.
The concept of ambiguity arises as soon as you consider a map or function f: A -> B , which is not injective.
An injection function (also called a one-to-one function) is such that for a β a 'then f(a) β f(a') . Given the function f, you are often interested in dealing with it: given the element b of the codename B of f, you want to know which element a of area A is such that f(a)=b . Note that it may not if the function is not surjective (i.e., on).
When a function is not injective, there may be several values ββof a in A such that f(a)=b . In other words, if you use the values ββin B that actually represent the values ββin through the mapping f, you have an ambiguous representation of b that cannot uniquely determine the value of a.
From this, you understand that the concept of ambiguity is so general that it is unlikely that there is a single body of knowledge about it, even limiting it to computer science and programming.
However, if you want to consider changing the function that creates such ambiguities, for example, to calculate the set f'(b)={aβA | f(a)=b} f'(b)={aβA | f(a)=b} or the best element in this set according to some optimality criterion, there are some methods that can help you in situations where the problem can be decomposed into subtasks, which are often repeated with the same arguments. Then, if you memorize the result for various combinations of arguments, you never compute the same thing twice (the subtask is called memo-ized ). Note that uncertainty may exist for subtasks, so there may be several answers to some examples of subtasks or optimal answers among several others.
This amount is for sharing a single copy of a subtask between all situations that require solving this using this set of parameters. the whole technique is called dynamic programming , and the difficulty is often to find the right decomposition into subtasks. dynamic programming is primarily a way to share the re-subcommand for a solution to reduce complexity. However, if each subcommittee creates a fragment of a structure that is recursively used in larger structures to find an answer that is a structured object (a graph for exmple), then splitting the step of the sum can also result in splitting the corresponding substructure in all places where it is necessary . When many answers are found (due to ambiguity for example), these answers can be shared sub-parts.
Instead of finding all the answers, you can use dynamic programming to find those that satisfy some optimality criterion. This requires that the optimal solution to the problem uses optimal solutions to the subtask equation.
Linguistic Processing Case
In linguistics and language, there may be more specific things handling. For this purpose, you need to define the domains that you work with with these domains.
The purpose of the language is to exchange information, concepts, ideas that are in our brains, with a very rough assumption that our brains use the same functions to represent these ideas linguistically. I should also greatly simplify the situation (excuse me), because this is not quite the place for a complete theory of language, which in any case will be disputed. And I can't even consider all types of syntactic theories.
So, the linguistic exchange of information, ideas, from person P to person Q is as follows:
idea in P ---f--> syntactic tree ---g--> lexical sequence ---h--> sound sequence | s | V idea in Q <--f'-- syntactic tree <--g'-- lexical sequence <--h'-- sound sequence
The first line about the sentence generation takes place personally by P, and the second line is the sentence analysis, carried out personally by Q. The function s means the transmission of speech and there must be an identification function. It is assumed that the functions f ', g', and h 'are inverse to the functions f, g, and h, which compute sequential representations up to the oral presentation of the idea. But each of these functions can be non-injective (as a rule), so ambiguities are introduced at each level, which makes it difficult for Q to invert the original value from the sequence of sound it receives (I intentionally use the word βsoundβ to avoid detail). The same diagram contains some changes in the details for written communication.
We ignore f and f 'because they are related to semantics, which may be less formalized and for which I have no competence. Syntax trees are often defined by grammar formalisms (here I skip over important enhancements, such as attribute structures, but they can be taken into account).
Both function g and function h are usually not injective, and thus are sources of ambiguity. There are other sources of ambiguity due to all the errors inherent in the speech chain, but we will ignore it for simplicity, since this does not greatly change the nature of the problems. The presence of errors due to the generation of sentences or transmission, or a mismatch in the language specification between the speaker and the listener, is an additional source of ambiguity, as the listener tries to correct potential errors, not knowing that they may have existed or exist at all.
We assume that the listener is not mistaken, and that he is trying to best "decode" the sentence in accordance with his linguistic standards and knowledge, including knowledge of the sources of errors and statistics.
Lexical ambiguity
Given the sound sequence, the listening system should invert the effect of the lexical generation function g with the function g '. The first probem is that several different words can produce the same sound sequence, which is the first source of ambiguity. The second problem is that the listening system actually gets the sequence corresponding to the line of words, and there can be no indication of where the words begin or end. Thus, they can be different ways of cutting a sound sequence in subsequences corresponding to recognizable words. This problem can get worse when noise creates more confusion between words.
As an example, the following holorime verses taken from the Internet , which are expressed more or less the same:
Ms Stephen, without a first-rate stakeholder sum or deal, Must, even with outer fur straight, stay colder - some ordeal.
Sound sequence analysis can be performed by the final state of a non-determinate automaton, interpreted in dynamic programming mode, which creates a directed acyclic graph, where nodes reinforce phrases and edges with reconstructed words. Any longest path through a graph corresponds to a possible path for analyzing a sound sequence as a sequence of words.
The above example shows a (fairly simple) word lattice (oriented from left to right):
the-*-fun / \ Ms -*-- Stephen \ without --*-- a first -*- ... / \ / \ / * * * \ / \ / \ must --*-- even with -*- outer fur -*- ...
So that the sound sequence also matches the following sequence word (among several others):
Ms Stephen, with outer first-rate ... Must, even with outer first-rate ...
This makes the lexical analysis ambiguous.
Probabilities can be used to select the best sequence. But it is also possible to preserve the ambiguity of all possible readings and use them as it happens in the next stage of sentence analysis.
Note that a word lattice can be considered as a finite state machine that generates or recognizes all possible lexical readings a sequence of words
Syntactic ambiguity
The syntactic structure is often based on context-free skeleton grammar. The problem of the ambiguity of context-free languages ββwell-known and analyzed. A number of common CF parsers have been developed to analyze ambiguous sentences and create a structure (which changes somewhat) from which all analyzes can be extracted. Such a structure has become known as parse forests or share parse forests.
It is known that the structure can be the worst cubic in length the analyzed sentence, provided that the language grammar is binarized, that is, no more than 2 non-terminals in each rule the right part (or, more simply, no more than 2 characters in each rule the right side).
Actually, all these general CF parsing algorithms are more or less complicated options around a simple concept: the intersection of the language L (A) of a state machine A and the language L (G) of the grammar CF G. The construction of such an intersection dates back to the early work on context-free languages ( Bar-Hillel, Perles and Shamir 1961) and was intended to prove the closure property. It took about thirty years to realize that it was also a very general parsing algorithm in a 1995 document .
This classic cross-product design provides CF grammar for the intersection of two languages ββL (A) and L (G). If you consider sentence w for analysis, presented as a sequence of lexical elements, it can also be considered as a finite state machine W generating only sentence w. For instance:
this is a finite state automaton => (1)------(2)----(3---(4)--------(5)-------(6)-----------((7))
- a finite state machine W, which accepts only the sentence w = "this is a finite state machine". So L (W) = {w}.
If the grammar of the language G, then the intersection construction gives the grammar G_w for the language A (G_w) = L (W) β© L (G).
This sentence w does not belong to L (G), then L (G_w) is empty, and the sentence is not recognized. Else L (G_w) = {w}. In addition, then it is easy to prove that the grammar G_w generates a sentence w with exactly the same parsing trees (hence the same ambiguity) as the grammar G, up to a simple renaming of non-terminals.
The G_w grammar is the (common) parsing forest for w , and the set of parsing trees w is exactly the set of differentiations with this grammar. Thus, it gives a very simple look at the organization of concepts and explaining the structure of shared shared forests and shared CF parsers.
But there is something else, because it shows how to generalize different grammars and different structures to be analyzed.
Constructive closure of intersection with regular sets through cross-product constructions is common for many grammatical formalisms, which to some extent increase the degree of CF-grammar in a context-sensitive area. This includes tree-based grammars and linear context-free rewriting systems. Therefore, this is how to build common parsers for these stronger formalisms that can deal with ambiguity and create common parses-forests, which are just specialized grammars of the same type.
Another generalization is that if there is a lexical ambiguity that a lot of candidate sentences with a shared word lattice are presented in the lexical analysis, this word lattice can be read as a state machine that recognizes all these sentences. Then the construction of the intersection will eliminate all sentences that are not in the language (non-grammatical), as well as create a CF grammar, which is a shared forest for all possible analyzes of all valid (grammatical) sentences from the word lattice.
In accordance with the request in the question, all possible ambiguous readings are stored as long as they are compatible with the existing linguistic or utterance information.
The processing of noise and poorly formed sentences is usually modeled also with finite state devices, and therefore their methods can be.
In fact, there are many other questions. For example, there are many ways to create a common forest with more or less exchange. The methods used for preliminary compilation of automata used for general contextual analysis of syntax affect the quality of sharing. Being too smart is not always very smart.
See also other answers . I did SE on this topic: