Extract a part of a Spacy document as a new document

I have a rather long text parsed Spacyinto an instance Doc:

import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp(content)

here docbecomes an instance of the classDoc .

Now, since the text is huge, I would like to process, experiment and visualize in Jupyter notepad using only one part of the document - for example, the first 100 sentences.

How can I slice and create a new instance Docfrom part of an existing document?

+6
source share
3 answers

A rather ugly way to achieve your goal is to build a list of proposals and build a new document from a subset of the proposals.

sentences = [sent.string.strip() for sent in doc.sents][:100]
minidoc = nlp(' '.join(sentences))

, , , .

+2

. , (.. ) :

char_end = 200
subdoc = nlp(doc.text[:char_end])
+1

There is more as_doc()solution using as_doc()for the object Span( https://spacy.io/api/span#as_doc ):

nlp = spacy.load('en_core_web_lg')
content = "This is my sentence. And here another one."
doc = nlp(content)
for i, sent in enumerate(doc.sents):
    print(i, "a", sent, type(sent))
    doc_sent = sent.as_doc()
    print(i, "b", doc_sent, type(doc_sent))

Gives a conclusion:

0 a This is my sentence. <class 'spacy.tokens.span.Span'>   
0 b This is my sentence.  <class 'spacy.tokens.doc.Doc'>   
1 a And here another one.  <class 'spacy.tokens.span.Span'>   
1 b And here another one.  <class 'spacy.tokens.doc.Doc'>

(the code fragment was written out completely for clarity - it can be shortened in the future)

0
source

Source: https://habr.com/ru/post/1690085/


All Articles