As @AleksandarSavkov already wrote, this is essentially a name recognition (NER) task - or, in general, a chunking task, as you already understood. How to do this is well described in chapter 7 of the NLTK book. I recommend that you ignore the regular expression labeling sections and use the approach in section 3, Designing and Evaluating chunkers . It includes code samples that you can use verbatim to create a chunker ( ConsecutiveNPChunkTagger ). It is your responsibility to select features that provide you good performance.
You need to convert your data to the IOB format expected by the NLTK architecture; it expects a portion of speech tags, so the first step should be to start your entry through a POS tag tag; nltk.pos_tag() will do a pretty decent job (as soon as you remove the markup, for example [KEYWORD ...] ), and does not require the installation of additional software. When your case is in the following format (word - POS tag - IOB tag), you are ready to train the recognizer:
Hi NNP O here RB O POS O my PRP$ O phone NN B-KEYWORD number NN I-KEYWORD , , O let VB O me PRP O ...
source share