What are WordNet lexicograph files? Understanding How WordNet Works

Question

What are WordNet lexicograph files? Understanding How WordNet Works

I am trying to understand WordNet file formats, and the main documents are WNDB and WNINPUT . As I understand it in WNDB, there are files called index.something and data.something , where this something can be noun, adv, vrb, adj .

So, if I want to learn something about the word dog as noun , I would look at index.noun , looking for the word dog , which gives me the line:

 dog n 7 5 @ ~ #m #p %p 7 1 02086723 10133978 10042764 09905672 07692347 03907626 02712903

According to the WNDB document, this line represents this data:

 lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]

Where lemma is the word, pos is the identifier that tells it the noun, synset_cnt tells us how many synsets this word is included, p_cnt tells us how many pointers to these synsets we have, [ptr_symbol] is an array of pointers, sense_cnt and tagsense_cnt I did not understand and would like to explain, and synset_offset is one or more synchronizations to view the data.noun file

Ok, so I know that these pointers point to something, and here are their descriptions, as written in WNINPUT:

 @ Hypernym ~ Hyponym #m Member holonym #p Part holonym %p Part meronym

I do not know how to find Hypernym for this noun, but continue:

Other important data are synset_offset s, which:

 02086723 10133978 10042764 09905672 07692347 03907626 02712903

Let's look at the first one, 02086723 , in data.noun :

 02086723 05 n 03 dog 0 domestic_dog 0 Canis_familiaris 0 023 @ 02085998 n 0000 @ 01320032 n 0000 #m 02086515 n 0000 #m 08011383 n 0000 ~ 01325095 n 0000 ~ 02087384 n 0000 ~ 02087513 n 0000 ~ 02087924 n 0000 ~ 02088026 n 0000 ~ 02089774 n 0000 ~ 02106058 n 0000 ~ 02112993 n 0000 ~ 02113458 n 0000 ~ 02113610 n 0000 ~ 02113781 n 0000 ~ 02113929 n 0000 ~ 02114152 n 0000 ~ 02114278 n 0000 ~ 02115149 n 0000 ~ 02115478 n 0000 ~ 02115987 n 0000 ~ 02116630 n 0000 %p 02161498 n 0000 | a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night"

As you can see, we found a line starting with 02086723 . The contents of this line are described in WNDB as:

 synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss

synset_offset we already know

lex_filenum says in which of the lexicograph files is our word (this is the part that I no longer understand) ,

ss_type n , which tells us that it is a noun,

w_cnt : A two-digit hexadecimal integer indicating the number of words in the syntax, which in this case is 03 , which means that we have 3 words in this syntax: dog 0 domestic_dog 0 Canis_familiaris 0 , each of which follows a number called

lex_id : a unique hexadecimal integer that, when added to the lemma, uniquely identifies the meaning in the lexicograph file

 p_cnt: counts the number of pointers, which in our case is `023`, so we have 23 pointers, wow

After p_cnt , then there are pointers, each of which has the format:

 pointer_symbol synset_offset pos source/target

Where pointer_symbol are characters like the ones I showed (@, ~, ...),

synset_offset : this is the offset of the target synchronization byte in the data file corresponding to pos

source/target : field distinguishes between lexical and semantic pointers. This is a four-byte field containing two two-digit hexadecimal numbers. The first two digits indicate the word number in the current (source) synchronizer, the last two digits indicate the word number in the target synchronized signal. A value of 0000 means that pointer_symbol is a semantic relationship between the current (source) synchronization and the target synchronization indicated by synset_offset.

So let's look at the first pointer:

 @ 02085998 n 0000

This is a pointer with the @ symbol denoting Hypernym , and indicates the offset wiuth sync 02085998 type n (noun), and source/target is 0000

When I search in data.noun, I get

 02085998 05 n 02 canine 0 canid 0 011 @ 02077948 n 0000 #m 02085690 n 0000 + 02688440 a 0101 ~ 02086324 n 0000 ~ 02086723 n 0000 ~ 02116752 n 0000 ~ 02117748 n 0000 ~ 02117987 n 0000 ~ 02119787 n 0000 ~ 02120985 n 0000 %p 02442560 n 0000 | any of various fissiped mammals with nonretractile claws and typically long muzzles

which is Hypernym of dog . So how do you find the relationship between syntheses. I suppose the pointer characters in the string for the dog were just to tell what types of relationships I could find for the word “dog”? Isn't that redundant? Because these pointer characters are already in each of synset_offsets , as we saw. When we look at each synset_offset in data.noun , we can see these pointer characters, so why are they needed in the index.noun file?

Also, look that I did not use the lexicograph file at all. I know that in data.noun , especially in the lex_filenum field, I can know where the data structure for dog , but what is this structure for ? As you can see, I could find a hypernim and many other relationships, just by looking at the index and data files, I did not use any of the files of the so-called lexicograph

+5

artificial-intelligence nlp wordnet ontology

Lucas zanella Feb 14 '17 at 2:44

source share

2 answers

Avner levy · Answer 1 · 2017-07-09T04:13:46+0000

In this information, the relationship that exists between them and (sometimes) the type of information is useful. Everyone uses Wordnet! Some even associate it with RDF notes. But ... I used Wordnet a few years ago, because I wanted to create the hypertension of words, their superclass and subclass (s), as well as several other types of relationships that were not in WN, I had to abandon Wordnet and its jargon. I needed a "less simplified" organization of the "real world." I came up with my own, with a mixture of Wiktionary, a lot of regular expressions, some YAGO, several other ontologies that allow me to create hierarchies and other relationships, some ML. I also looked at the classification of Roger Shank, the Rogue thesaurus and various attempts to identify and classify (typologies) concepts such as Wierzbicka and others. If you want something serious, diy.

alvas · Answer 2 · 2017-02-14T02:49:22+0000

Yes, the Wordnet documentation is pretty hard to read ...

You are looking for this page: https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html

During development of WordNet synsets are combined into forty-five lexicograph files based on the syntactic category and logical groupings

These groups are a kind of parallel clusters (flat groups) for the hierarchy of hyperginonyms.

In short:

From the docs:

File Format [Lexicograph Files in WordNet-3.0/dict/ ]

Each line in the lex names contains 3 fields with tab delimiters and ends with a new line. The first field is a two-digit decimal integer file number. (The first file in the list is numbered 00.) The second field is the name of the lexicograph file, which is represented by this number, and the third field is an integer that indicates the syntactic category of synchronizations contained in the file. This is just a shortcut to programs and scripts, since the syntax category is also part of the lexicographer's file name.

In the explanation of the layman (me):

This is just the standard of how you should assign values to the second column in files, for example. data.nouns , data.verbs , etc.
Traditionally, the creators / maintainers of Wordnet should name their files appropriately, but sometimes it's easier to just move all the nouns together and use an index that indicates the category of synchronization.

The guidelines for the categories are as follows:

 File Number Name Contents 00 adj.all all adjective clusters 01 adj.pert relational adjectives (pertainyms) 02 adv.all all adverbs 03 noun.Tops unique beginner for nouns 04 noun.act nouns denoting acts or actions 05 noun.animal nouns denoting animals 06 noun.artifact nouns denoting man-made objects 07 noun.attribute nouns denoting attributes of people and objects 08 noun.body nouns denoting body parts 09 noun.cognition nouns denoting cognitive processes and contents 10 noun.communication nouns denoting communicative processes and contents 11 noun.event nouns denoting natural events 12 noun.feeling nouns denoting feelings and emotions 13 noun.food nouns denoting foods and drinks 14 noun.group nouns denoting groupings of people or objects 15 noun.location nouns denoting spatial position 16 noun.motive nouns denoting goals 17 noun.object nouns denoting natural objects (not man-made) 18 noun.person nouns denoting people 19 noun.phenomenon nouns denoting natural phenomena 20 noun.plant nouns denoting plants 21 noun.possession nouns denoting possession and transfer of possession 22 noun.process nouns denoting natural processes 23 noun.quantity nouns denoting quantities and units of measure 24 noun.relation nouns denoting relations between people or things or ideas 25 noun.shape nouns denoting two and three dimensional shapes 26 noun.state nouns denoting stable states of affairs 27 noun.substance nouns denoting substances 28 noun.time nouns denoting time and temporal relations 29 verb.body verbs of grooming, dressing and bodily care 30 verb.change verbs of size, temperature change, intensifying, etc. 31 verb.cognition verbs of thinking, judging, analyzing, doubting 32 verb.communication verbs of telling, asking, ordering, singing 33 verb.competition verbs of fighting, athletic activities 34 verb.consumption verbs of eating and drinking 35 verb.contact verbs of touching, hitting, tying, digging 36 verb.creation verbs of sewing, baking, painting, performing 37 verb.emotion verbs of feeling 38 verb.motion verbs of walking, flying, swimming 39 verb.perception verbs of seeing, hearing, feeling 40 verb.possession verbs of buying, selling, owning 41 verb.social verbs of political and social activities and events 42 verb.stative verbs of being, having, spatial relations 43 verb.weather verbs of raining, snowing, thawing, thundering 44 adj.ppl participial adjectives

So, for example, in WordNet-3.0/dict/data.noun , we see the lines:

 00034213 03 n 01 phenomenon 0 008 @ 00029677 n 0000 ~ 11408559 n 0000 ~ 11408733 n 0000 ~ 11408914 n 0000 ~ 11410625 n 0000 ~ 11418138 n 0000 ~ 11418460 n 0000 ~ 11529295 n 0000 | any state or process known through the senses rather than by intuition or reasoning 00034479 04 n 01 thing 0 001 @ 00037396 n 0000 | an action; "how could you do such a thing?"

Look at the second column, for phenomenon value is 03 , which points to noun.Tops .

For thing it has a value of 04 , which refers to noun.act .

IMHO, depending on the use, these assignments may not be suitable. They are mainly used when creating wordnet and how we can easily smooth ontological hierarchies into simple flat clusters.

What are WordNet lexicograph files? Understanding How WordNet Works

More articles: