I am trying to understand WordNet file formats, and the main documents are WNDB and WNINPUT . As I understand it in WNDB, there are files called index.something and data.something , where this something can be noun, adv, vrb, adj .
So, if I want to learn something about the word dog as noun , I would look at index.noun , looking for the word dog , which gives me the line:
dog n 7 5 @ ~ #m
According to the WNDB document, this line represents this data:
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
Where lemma is the word, pos is the identifier that tells it the noun, synset_cnt tells us how many synsets this word is included, p_cnt tells us how many pointers to these synsets we have, [ptr_symbol] is an array of pointers, sense_cnt and tagsense_cnt I did not understand and would like to explain, and synset_offset is one or more synchronizations to view the data.noun file
Ok, so I know that these pointers point to something, and here are their descriptions, as written in WNINPUT:
@ Hypernym ~ Hyponym
I do not know how to find Hypernym for this noun, but continue:
Other important data are synset_offset s, which:
02086723 10133978 10042764 09905672 07692347 03907626 02712903
Let's look at the first one, 02086723 , in data.noun :
02086723 05 n 03 dog 0 domestic_dog 0 Canis_familiaris 0 023 @ 02085998 n 0000 @ 01320032 n 0000
As you can see, we found a line starting with 02086723 . The contents of this line are described in WNDB as:
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
synset_offset we already know
lex_filenum says in which of the lexicograph files is our word (this is the part that I no longer understand) ,
ss_type n , which tells us that it is a noun,
w_cnt : A two-digit hexadecimal integer indicating the number of words in the syntax, which in this case is 03 , which means that we have 3 words in this syntax: dog 0 domestic_dog 0 Canis_familiaris 0 , each of which follows a number called
lex_id : a unique hexadecimal integer that, when added to the lemma, uniquely identifies the meaning in the lexicograph file
p_cnt: counts the number of pointers, which in our case is `023`, so we have 23 pointers, wow
After p_cnt , then there are pointers, each of which has the format:
pointer_symbol synset_offset pos source/target
Where pointer_symbol are characters like the ones I showed (@, ~, ...),
synset_offset : this is the offset of the target synchronization byte in the data file corresponding to pos
source/target : field distinguishes between lexical and semantic pointers. This is a four-byte field containing two two-digit hexadecimal numbers. The first two digits indicate the word number in the current (source) synchronizer, the last two digits indicate the word number in the target synchronized signal. A value of 0000 means that pointer_symbol is a semantic relationship between the current (source) synchronization and the target synchronization indicated by synset_offset.
So let's look at the first pointer:
@ 02085998 n 0000
This is a pointer with the @ symbol denoting Hypernym , and indicates the offset wiuth sync 02085998 type n (noun), and source/target is 0000
When I search in data.noun, I get
02085998 05 n 02 canine 0 canid 0 011 @ 02077948 n 0000
which is Hypernym of dog . So how do you find the relationship between syntheses. I suppose the pointer characters in the string for the dog were just to tell what types of relationships I could find for the word βdogβ? Isn't that redundant? Because these pointer characters are already in each of synset_offsets , as we saw. When we look at each synset_offset in data.noun , we can see these pointer characters, so why are they needed in the index.noun file?
Also, look that I did not use the lexicograph file at all. I know that in data.noun , especially in the lex_filenum field, I can know where the data structure for dog , but what is this structure for ? As you can see, I could find a hypernim and many other relationships, just by looking at the index and data files, I did not use any of the files of the so-called lexicograph