Tagger Training with Custom Tags in NLTK

I have a document with tagged data in the format Hi here my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York] Hi here my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York] . I want to train a model based on a set of these tags with tags, and then use my model to label new documents. Is this possible at NLTK? I looked at chunking and NLTK-Trainer scripts, but they have a limited set of tags and bodies, while in my dataset there are custom tags.

+5
source share
2 answers

As @AleksandarSavkov already wrote, this is essentially a name recognition (NER) task - or, in general, a chunking task, as you already understood. How to do this is well described in chapter 7 of the NLTK book. I recommend that you ignore the regular expression labeling sections and use the approach in section 3, Designing and Evaluating chunkers . It includes code samples that you can use verbatim to create a chunker ( ConsecutiveNPChunkTagger ). It is your responsibility to select features that provide you good performance.

You need to convert your data to the IOB format expected by the NLTK architecture; it expects a portion of speech tags, so the first step should be to start your entry through a POS tag tag; nltk.pos_tag() will do a pretty decent job (as soon as you remove the markup, for example [KEYWORD ...] ), and does not require the installation of additional software. When your case is in the following format (word - POS tag - IOB tag), you are ready to train the recognizer:

 Hi NNP O here RB O POS O my PRP$ O phone NN B-KEYWORD number NN I-KEYWORD , , O let VB O me PRP O ... 
+4
source

The problem you are trying to solve is usually called Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important thing is that you need to convert your text data into a suitable format for sequence tags. Here is an example of the BIO format:

 IO love O Paris B-LOC and O New B-LOC York I-LOC . O 

From there you can choose any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently, the most popular algorithm for such multi-point sequence classification problems is CRF . There are tools available that will allow you to train the BIO model (although it was originally intended for chunking) from a file using the format shown above (for example, YamCha , CRF ++ , CRFSuite , Wapiti ). If you use Python, you can look at scikit-learn , python-crfsuite and PyStruct in addition to NLTK.

+1
source

Source: https://habr.com/ru/post/1235975/


All Articles