What is the dimension in phrases?

I want to understand what is meant by “dimension” in word attachments.

When I insert a matrix word for NLP tasks, what role does dimension play? Is there a good example that can help me understand this concept?

+8
source share
5 answers

Answer

A Word attachment is simply a mapping of words into vectors. The dimension in the word Attachments refers to the length of these vectors.

Additional Information

These mappings come in many formats. Most pre-trained attachments are available as a space-separated text file, where each line contains the word in the first position and its vector representation next to it. If you were to separate these lines, you will find that they have a length of 1 + dim , where dim is the dimension of the word's vectors, and 1 corresponds to the word being represented. See GloVe pre-trained vectors for a real example.

For example, if you download glove.twitter.27B.zip , unzip it and run the following Python code:

 #!/usr/bin/python3 with open('glove.twitter.27B.50d.txt') as f: lines = f.readlines() lines = [line.rstrip().split() for line in lines] print(len(lines)) # number of words (aka vocabulary size) print(len(lines[0])) # length of a line print(lines[130][0]) # word 130 print(lines[130][1:]) # vector representation of word 130 print(len(lines[130][1:])) # dimensionality of word 130 

you will get a conclusion

 1193514 51 people ['1.4653', '0.4827', ..., '-0.10117', '0.077996'] # shortened for illustration purposes 50 

To some extent unrelated, but no less important, the lines in these files are sorted according to the frequency of words found in the corps in which the attachments were trained (first of all, the most common words).


You can also present these attachments in the form of a dictionary, where keys are words and values ​​are lists representing word vectors. The length of these lists will be the dimension of the vectors of your words.

A more common practice is to present them in the form of matrices (also called table lookups), dimensions (V x D) , where V is the size of the dictionary (that is, how many words you have), and D is the dimension of each vector word. In In this case, you need to save a separate dictionary that displays each word in its corresponding row in the matrix.

Background

As for your question about the role that dimension plays , you will need some theoretical knowledge. But in a few words, the space in which words are embedded has good features that make NLP systems work better. One of these properties is that words with the same meaning are spatially close to each other, that is, they have similar vector representations measured by a distance metric, such as Euclidean distance or cosine similarity .

You can visualize a 3D projection of several word attachments here and see, for example, that the closest words to “roads” are “highway”, “road” and “routes” in the Word2Vec 10K attachment.

For a more detailed explanation, I recommend reading the “Attachments of Words” section of this post by Christopher Olach.

For a larger theory on why using word embedding, which is an instance of distributed representations, is better than using, for example, one-time encodings (local representations), I recommend reading the first sections of distributed representations by Jeffrey Hinton et al.

+9
source

Word attachments, such as word2vec or GloVe, do not insert words into two-dimensional matrices, they use one-dimensional vectors . "Dimension" refers to the size of these vectors. It is separate from the size of the dictionary, which is the number of words that you actually hold vectors into, instead of just throwing it away.

In theory, large vectors can store more information, since they have more possible states. In practice, there are not many advantages exceeding the size of 300-500, and in some applications even small vectors work fine.

Here is the GloVe homepage page .

vector vector render

The dimension of the vectors is shown on the left axis; reducing it, for example, the graph will be shorter. Each column is a separate vector with a color in each pixel, determined by the number at this position in the vector.

+4
source

I am not an expert, but I think that sizes simply represent variables (aka attributes or functions) that were assigned to words, although there may be more than that. The value of each measurement and the total number of measurements will be specific to your model.

I recently saw this implementation visualization from the Tensor Flow library: https://www.tensorflow.org/get_started/embedding_viz

This especially helps reduce models with a high degree of accuracy to something that is perceived by humans. If you have more than three variables, it is extremely difficult to visualize clustering (unless, of course, you are Stephen Hawking).

This article is a wikipedia article on downsizing and related pages discussing how functions are represented in dimensions and problems with too many.

+1
source

"Dimension" in vocabulary attachments represents the total number of functions that it encodes. In fact, this is a simplification of the definition, but will come to this point later.

The choice of functions is usually not manual, it is automatically using a hidden layer in the learning process. Depending on the body of literature, the most useful sizes (functions) are selected. For example, if the literature is devoted to romantic fictions, then the dimension for sex is much more likely to be presented compared to the literature of mathematics .

When you have an embedding vector of a vector of 100 dimensions (for example) created by a neural network for 100,000 unique words, it is usually impractical to examine the purpose of each dimension and try to label each dimension with a “function name”. Since the function (s) that each dimension represents may not be simple and orthogonal, and since the process is automatic, no one knows exactly what each dimension represents.

For a deeper understanding of this topic, you may find this post helpful.

+1
source

According to the book Neural Network Methods for Natural Language Processing Goldenberg , the dimensionality in word embeddings ( demb ) refers to the number of columns in the first weight matrix (the weights between the input layer and the hidden layer) of demb algorithms such as word2vec , N in the image - dimensionality in the nesting of words : skip-gram model

For more information, you can refer to this link: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

0
source

Source: https://habr.com/ru/post/1270354/


All Articles