How can I use the full penn treebank dataset inside python / nltk

Question

How can I use the full penn treebank dataset inside python / nltk

I am trying to learn using NLTK package in python. In particular, I need to use the penn tree bank dataset in NLTK. As far as I know, if I call nltk.download('treebank') , I can get 5% of the data set. However, I have a complete dataset in the tar.gz file and I want to use it. It says here that:

If you have access to the full Penn Treebank installation, NLTK can be configured to download it. Download the ptb package and in the directory nltk_data / corpora / ptb place the BROWN and WSJ directories of the Treebank installation (symbolic links also work). then use the ptb module instead of treebank:

So, I opened python from the terminal, imported nltk and typed nltk.download('ptb') . Using this command, the "ptb" directory was created in my ~/nltk_data . In the end, I now have the ~/nltk_data/ptb . Inside, as suggested in the link above, I placed the folder of my dataset. So this is my final directory hierarchy.

  $: pwd $: ~/nltk_data/corpora/ptb/WSJ $: ls $:00 02 04 06 08 10 12 14 16 18 20 22 24 01 03 05 07 09 11 13 15 17 19 21 23 merge.log

Inside all folders from 00 to 24 there are many .mrg files, such as wsj_0001.mrg , wsj_0002.mrg , etc.

Now back to my question. Again, according to here :

I should be able to get the file IDs if I write the following:

 >>> from nltk.corpus import ptb >>> print(ptb.fileids()) # doctest: +SKIP ['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...]

Unfortunately, when I type print(ptb.fileids()) , I got an empty array.

 >>> print(ptb.fileids()) []

Is there anyone who could help me?

EDIT here is the contents of my ptb directory and some allcats.txt files:

  $: pwd $: ~/nltk_data/corpora/ptb $: ls $: allcats.txt WSJ $: cat allcats.txt $: WSJ/00/WSJ_0001.MRG news WSJ/00/WSJ_0002.MRG news WSJ/00/WSJ_0003.MRG news WSJ/00/WSJ_0004.MRG news WSJ/00/WSJ_0005.MRG news and so on ..

+5

python nlp nltk corpus penn-treebank

zwlayer Mar 18 '16 at 8:21

source share

1 answer

freieschaf · Answer 1 · 2016-04-29T10:21:39+0000

To read PTB corpus, you need a capital directory and file names (as outlined by the contents of allcats.txt , which you included in your question). This is due to many Penn Treebank distributions that use lowercase letters.

A quick fix for this would be to rename the wsj and brown folders and their contents to uppercase. The UNIX command you can use to do this:

 find . -depth | \ while read LONG do SHORT=$( basename "$LONG" | tr '[:lower:]' '[:upper:]' ) DIR=$( dirname "$LONG" ) if [ "${LONG}" != "${DIR}/${SHORT}" ] then mv "${LONG}" "${DIR}/${SHORT}" fi done

(Derived from this question ). It will change the directory and file names to uppercase recursively.

How can I use the full penn treebank dataset inside python / nltk

More articles: