How can I access raw documents from Brown Corps?

For all other NLTK packages, the call corpus.raw()gives the source text from files. For instance:

>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'

However, when called, brown.raw()you get tagged text.

>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '

I read all the documentation that I can find, but cannot find an obvious explanation or a way to get an unlabeled version. Is there a reason this case is labeled and others not?

+4
source share
2 answers

TL DR

import nltk
nltk.download('brown')
nltk.download('nonbreaking_prefixes')
nltk.download('perluniprops')

from nltk.corpus import brown
from nltk.tokenize.moses import MosesDetokenizer

mdetok = MosesDetokenizer()

brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

for sent in brown_natural:
    print(sent)

In the long

This is because the "raw" version of the "Brown" case is symbolized and marked, i.e. the casing is marked as the original casing shape =)

You can see individual files in the directory nltk_data:

$ head -n10 nltk_data/corpora/brown/ca01


    The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


    The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.


    The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.

, brown.words(), .

>>> from nltk.corpus import brown

>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

>>> ' '.join(brown.words()[:30])
u"The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

:

>>> brown.fileids()[:10] # The first 10 fileids from brown.
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']

>>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
u"The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

:

>>> brown.sents('ca01')
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

:

>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     print(' '.join(sent))
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia registration and election laws `` are outmoded or inadequate and often ambiguous '' .

, MosesDetokenizer:

, MosesDetokenizer:

>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True

MosesDetokenizer:

>>> from nltk.tokenize.moses import MosesDetokenizer
>>> mdetok = MosesDetokenizer()

MosesDetokenizer.detokenize():

>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     # Join the words in sentences and convert the `` -> "
...     # also convert '' -> " and ` -> '
...     munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
...     print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election produced "no evidence" that any irregularities took place.
The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
"Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of voters and the size of this city".
The jury said it did find that many of Georgia registration and election laws "are outmoded or inadequate and often ambiguous".

brown :

from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

[]:

>>> for sent in brown_natural:
...     print(sent)
...     break
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election produced "no evidence" that any irregularities took place.
+1

- , Brown corpus. raw() , ; " ", " ", . , nltk.corpus.treebank.raw('wsj_0001.mrg') nltk.corpus.conll2000.raw("train.txt"), IOB .

, , , :

for sent in brown.sents():
    print(" ".join(sent))

:

`` Only a relative handful of such reports was received '' , the jury said , `` considering
the widespread interest in the election , the number of voters and the size of this 
city '' .

, , . alvas .

+1

Source: https://habr.com/ru/post/1689296/


All Articles