I am trying to learn using NLTK package in python. In particular, I need to use the penn tree bank dataset in NLTK. As far as I know, if I call nltk.download('treebank')
, I can get 5% of the data set. However, I have a complete dataset in the tar.gz file and I want to use it. It says here that:
If you have access to the full Penn Treebank installation, NLTK can be configured to download it. Download the ptb package and in the directory nltk_data / corpora / ptb place the BROWN and WSJ directories of the Treebank installation (symbolic links also work). then use the ptb module instead of treebank:
So, I opened python from the terminal, imported nltk and typed nltk.download('ptb')
. Using this command, the "ptb" directory was created in my ~/nltk_data
. In the end, I now have the ~/nltk_data/ptb
. Inside, as suggested in the link above, I placed the folder of my dataset. So this is my final directory hierarchy.
$: pwd $: ~/nltk_data/corpora/ptb/WSJ $: ls $:00 02 04 06 08 10 12 14 16 18 20 22 24 01 03 05 07 09 11 13 15 17 19 21 23 merge.log
Inside all folders from 00 to 24 there are many .mrg
files, such as wsj_0001.mrg , wsj_0002.mrg
, etc.
Now back to my question. Again, according to here :
I should be able to get the file IDs if I write the following:
>>> from nltk.corpus import ptb >>> print(ptb.fileids())
Unfortunately, when I type print(ptb.fileids())
, I got an empty array.
>>> print(ptb.fileids()) []
Is there anyone who could help me?
EDIT here is the contents of my ptb directory and some allcats.txt files:
$: pwd $: ~/nltk_data/corpora/ptb $: ls $: allcats.txt WSJ $: cat allcats.txt $: WSJ/00/WSJ_0001.MRG news WSJ/00/WSJ_0002.MRG news WSJ/00/WSJ_0003.MRG news WSJ/00/WSJ_0004.MRG news WSJ/00/WSJ_0005.MRG news and so on ..