Incrementally trained entity recognition classifier

I'm researching semantic-web / nlp, and I have a set of sparse records containing a combination of numeric and non-numeric data representing objects marked with various functions extracted from simple English sentences.

eg.

uid|features 87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue 4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street 23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice 09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana 928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious 294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry ... 

A single object may appear more than once (but with a different UID each time) and may have overlapping functions with its other appearances. The second data set represents which of the above UIDs is definitely the same.

eg.

 uid|sameas 87w39423|234k2j,234l24jlsd,dsdf9887s 4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf 23432424|io43po5,2l3jk42,sdf90s8df 09834502|294234234,sd09f8098 ... 

Which algorithm would I use to incrementally train a classifier that could take a set of functions, and immediately recommend the N most similar UIDs and the likelihood that these UIDs really represent the same object? If desired, I would also like to receive a recommendation on the missing functions to fill out, and then reclassify to get more specific matches.

I explored traditional approximate nearest neighbor algorithms. for example, FLANN and ANN , and I don’t think it would be appropriate, since they are not learnable (in a controlled sense of learning), and they are usually not intended for sparse non-numeric input.

As a very naive first attempt, I thought about using a naive Bayesian classifier by converting each SameAs relationship into a set of training samples. Thus, for each entity A with relations B the same way, I will iterate over each and train the classifier, for example:

 classifier = Classifier() for entity,sameas_entities in sameas_dataset: entity_features = get_features(entity) for other_entity in sameas_entities: other_entity_features = get_features(other_entity) classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features]) classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features]) 

And then use it like:

 >>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7) [(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')] >>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7) [(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')] >>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4) ['obj_called','has_fur','has_claws','lives_at_zoo'] 

How viable is this approach? Initial training in a batch would be very slow, at least O (N ^ 2), but support for incremental training would allow for faster updates.

What are the best approaches?

+4
source share
2 answers

I think this is more of a clustering than a classification problem. Your objects are data points, and data is a mapping of objects to clusters. In this case, clusters are different "things" that your objects refer to.

You might want to take a look at half-watched clusters. A brief Google search showed the document “Active semi-observation for pairwise limitation of clustering” , which provides pseudo-code for an algorithm that is incremental / active and uses observation in the sense that it receives training data indicating which objects are or are in the same cluster. You can get this easily from your data, considering that, for example, uids 87w39423 and 4535k3l535 are, of course, different things.

However, to make this work, you need to come up with a distance metric based on the data characteristics. Here you have many options, for example, you can use the simple Hamming distance on functions, but the choice of the metric function is a little arbitrary. I don’t know about good ways to choose a metric, but you may have already learned this when you looked at nearest neighbor algorithms.

You can find confidence estimates using distance metrics from cluster centers. If you want a real probability of membership, then you would like to use a probabilistic clustering model, such as the Gaussian mixture model. There are quite a lot of software for modeling a Gaussian mixture, I do not know any that are semi-observing or incremental.

There may be other suitable approaches if the question you wanted to answer was something like "if you have an object, which other objects can refer to the same thing?", But I do not think that this is what you are after.

+2
source

You can take a look at this method:

“Widely scaling online learning for similarity of images using ratings,” Gal Chechik, Varun Sharma, Uri Shalit, and Sammi Bengio, Machine Learning Research magazine (2010). [ PDF ] [ Project Page ]

Other thoughts:

What do you mean by entity? Is the entity referenced by "obj_called"? Do you use obj_called content to match various objects, e.g. Does John look like John Doe? Do you use proximity between sentences to denote similar objects? What is the big goal (task) of the display?

0
source

Source: https://habr.com/ru/post/1379669/


All Articles