Theory: "Lexical Coding"

Question

Theory: "Lexical Coding"

I use the term "Lexical Coding" for my lack of a better one.

The word is perhaps the basic unit of communication, not the letter. Unicode tries to assign a numerical value to each letter of all known alphabets. What is a Letter to one language is a symbol of another. Unicode 5.1 assigns more than 100,000 values to these characters. Of the approximately 180,000 words used in modern English, it is said that with vocabulary of about 2,000 words you should be able to speak in general terms. “Lexical coding” will encode every Word not every letter and encapsulate them in expression.

// An simplified example of a "Lexical Encoding" String sentence = "How are you today?"; int[] sentence = { 93, 22, 14, 330, QUERY };

In this example, each token in String was encoded as an integer. Here, the coding scheme simply assigns an int value based on a generalized statistical ranking of word usage and assigns a constant to the question mark.

Ultimately, the Word has both spelling and meaning. Any “Lexical encoding” will retain the meaning and intention of the sentence as a whole, and not the language. The English sentence will be encoded into "... language-neutral atomic elements of meaning ..." , which can then be rearranged into any language with a structured Syntactic form and Grammatical structure.

What other examples of Lexical Coding methods?

If you were wondering where the word usage statistics came from:
http://www.wordcount.org

+4

encoding theory nlp linguistics

Andrew Turner Oct 4 '08 at 14:48

source share

8 answers

This issue affects linguistics more than programming, but for languages that are synthetic (having words that consist of several combined morphemes), it can be very difficult to try to “number” all possible words, as opposed to languages such as English that are at least somewhat isolated, or languages such as Chinese that are highly analytical.

That is, words cannot be easily broken and counted based on their constituent glyphs in some languages.

This Wikipedia article on Language Isolation may be helpful in explaining the problem.

+6

Adam bellaire Oct 4 '08 at 2:56 p.m.

source share

It's easy enough to invent one for yourself. Turn each word into a canonical byte stream (say, in the lower case of decomposed UCS32), then hash it to the integer. 32 bits will probably be enough, but if not then 64 bits will probably be.

Before you decide to give you an obscene answer, consider that the goal of Unicode is simply to give each glyph a unique identifier. Do not rank, sort, or group them, but simply map each to a unique identifier that everyone agrees on.

+3

Mike f Oct 4 '08 at 17:03

source share

How would a system handle pluralization of nouns or conjugation of verbs? Will each of them have its own Unicode value?

+2

b3. Oct 4 '08 at 17:11

source share

As a translation scheme, this probably would not work without a lot of work. You would like to think that you can assign a number to each word, and then mechanically translate it into another language. In fact, languages have a problem with several words that are written in the same way, "the wind blows back", and "the wind is your watch."

To convey text where you are supposed to have an alphabet for each language, it will work fine, although I am interested in what you got there, and not using a dictionary of variable length, such as ZIP.

+2

Mark bessey Oct 4 '08 at 17:25

source share

This is an interesting question, but I suspect you are asking it for the wrong reasons. Do you think of this “lexical Unicode” as something that will allow you to break sentences into language-neutral atomic elements of meaning, and then be able to recreate them in some other specific language? As a means of achieving a universal translator, is it possible?

Even if you can encode and store, say, an English sentence using "lexical unicode", you cannot expect to read it and magically display it, say, in Chinese, keeping the value intact.

Your Unicode analogy, however, is very useful.

Keep in mind that Unicode, although a “universal” code, does not embody the pronunciation, meaning, or use of the character in question. Each point in the code refers to a particular glyph in a particular language (or rather, in a script used by a group of languages). It is elemental at the level of visual representation of the glyph (within the framework of style, formatting and fonts). The Unicode code point for the Latin letter "A" is exactly that. This is the Latin letter "A". It cannot be automatically displayed, for example, in the Arabic letter Alif (ا) or in the index (Devnagari) "A" (अ).

Keeping the Unicode analog, your lexical Unicode will have codes for each word (word form) in each language. Unicode has code point ranges for a specific script. Your lexical Unicode will have a series of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), must have different code points. The same word, with different meanings, or different pronunciations (homonyms), must have different code points.

In Unicode for some languages (but not for all), where one and the same character has a different shape depending on its position in the word - for example. in Hebrew and Arabic, the shape of the glyph changes at the end of the word - then it has a different code point. Similarly, in your lexical Unicode, if a word has a different form depending on its position in a sentence, it can guarantee its own code point.

Perhaps the easiest way to come up with code points for the English language would be to base your system on, say, a specific edition of the Oxford English Dictionary and sequentially assign a unique code for each word. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - for example, if the same word can be used as a noun and as a verb, then you will need two codes

Then you have to do the same for every other language you want to include - using the most authoritative dictionary for that language.

Most likely, this exercise is more and more effort than worth it. If you decide to include in the world all the living languages of the world, as well as some historical dead and some fictional ones - as Unicode does, you will get a code space that is so large that your code must be extremely wide to accommodate it. You won’t get anything in terms of compression — it’s likely that a sentence represented as a String in the original language will take up less space than the same sentence presented as code.

PS For those who say that this is an impossible task, because the meanings of the words change, I do not see a problem in this. To use the Unicode analogy, the use of letters has changed (though not as fast as the meaning of words), but Unicode does not refer to the fact that “th” used to be used as “y” in the Middle Ages. Unicode has a code point for 't', 'h' and 'y', and each of them performs its own task.

PPS Actually, it refers to Unicode, that "oe" also "œ" or "ss" can be written "ß" in German

+2

Vihung Oct 6 '08 at 13:19

source share

This is an interesting little exercise, but I would strongly recommend that you consider it as an introduction to the concept of the difference in natural language between types and tokens.

A type is the only instance of a word representing all instances. A token is the only account for each instance of a word. Let me explain this in the following example:

"John went to the bakery, bought bread."

Here are a few frequency samples for this example, and the calculations mean the number of tokens:

 John: 1 went: 1 to: 1 the: 2 store: 1 he: 1 bought: 1 bread: 2

Note that "the" counts twice - there are two signs of "the". However, note that although there are ten words, only eight of these pairs are “two to two”. Words are broken down into types and connected with their number of tokens.

Types and tokens are useful in statistical NLP. Lexical Encoding, on the other hand, I would follow. This is a step towards much more old-fashioned approaches to NLP, with preprogramming and rationalism. I don’t even know about any statistical MT that actually assigns a specific “address” to a word. There are too many relationships between words, on the one hand, for creating any thoughtful numerical ontology, and if we just throw numbers at words to classify them, we should think about things like memory management and allocation for speed,

I would suggest checking out NLTK, the Natural Language Toolkit written in Python, for a better understanding of NLP and its practical use.

+1

Robert Elwell Oct 10 '08 at 17:11

source share

In fact, you only need about 600 words for half a decent vocabulary.

0

leppie Oct 6 '08 at 12:12

source share

Vincent mcnabb · Accepted Answer · 2008-10-06T11:30:39+0000

These are some serious issues with this idea. In most languages, the meaning of the word and the word associated with the meaning change very quickly.

As soon as you have the number assigned to the word, before the meaning of the word changes. For example, the word “gay” only means “happy” or “funny”, but currently it is mainly used to refer to homosexuals. Another example is the thank you morpheme, which was originally taken from the German danke, which is just one word. Another example is Goodbye, which is the abbreviation God bless you.

Another problem is that even if you take a picture of a word at any given time, the meaning and use of this word will contradict even within the same province. When dictionaries are written, often academicians have to argue on one word.

In short, you cannot do this with your existing language. You would have to consider inventing your own language for this purpose or using a rather static language that was already invented, for example, Interlingua or Esperanto. However, even they would not be ideal for defining static morphemes in the standard vocabulary.

Even in Chinese, where there is a rough comparison of character with meaning, it still will not work. Many characters change their meaning depending on both contexts, and which characters are either preceding or postfix.

The worst case problem is when trying to translate between languages. There can be one word in English that can be used in various cases, but cannot be used directly in another language. An example of this is "free." In Spanish, you can use either “libre,” which means “free,” as in speech, or “for free,” which means “free,” as in beer (and using the wrong word instead of “free” will look very funny).

There are other words that are even more difficult to make sense, for example, the beautiful word in Korean; when calling a beautiful girl, there would be several candidates for replacement; but when you call food beautiful, if you do not mean that food looks good, there are several other candidates that are completely different.

What is the matter, although we use only about 200 thousand words in English, our dictionaries are actually more in some aspects, because we attribute many different meanings to the same word. The same problems apply to Esperanto and interlingua, as well as to any other language that makes sense for conversation. Human speech is not a well-defined, well-oiled machine. So, although you could create such a vocabulary where each “word” would have its own unique meaning, it would be very difficult and almost impossible for machines using modern methods to translate from any human language into your special standardized vocabulary.

That is why machine translation is still sucking and will be for a long time. If you can do better (and I hope you do), then you should probably consider doing this with some kind of scholarship and / or university / government funding, working at PHD; or just make a ton of money no matter what your ship holds for a couple.

Theory: "Lexical Coding"

More articles: