I'm researching semantic-web / nlp, and I have a set of sparse records containing a combination of numeric and non-numeric data representing objects marked with various functions extracted from simple English sentences.
eg.
uid|features 87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue 4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street 23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice 09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana 928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious 294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry ...
A single object may appear more than once (but with a different UID each time) and may have overlapping functions with its other appearances. The second data set represents which of the above UIDs is definitely the same.
eg.
uid|sameas 87w39423|234k2j,234l24jlsd,dsdf9887s 4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf 23432424|io43po5,2l3jk42,sdf90s8df 09834502|294234234,sd09f8098 ...
Which algorithm would I use to incrementally train a classifier that could take a set of functions, and immediately recommend the N most similar UIDs and the likelihood that these UIDs really represent the same object? If desired, I would also like to receive a recommendation on the missing functions to fill out, and then reclassify to get more specific matches.
I explored traditional approximate nearest neighbor algorithms. for example, FLANN and ANN , and I don’t think it would be appropriate, since they are not learnable (in a controlled sense of learning), and they are usually not intended for sparse non-numeric input.
As a very naive first attempt, I thought about using a naive Bayesian classifier by converting each SameAs relationship into a set of training samples. Thus, for each entity A with relations B the same way, I will iterate over each and train the classifier, for example:
classifier = Classifier() for entity,sameas_entities in sameas_dataset: entity_features = get_features(entity) for other_entity in sameas_entities: other_entity_features = get_features(other_entity) classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features]) classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7) [(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')] >>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7) [(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')] >>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4) ['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? Initial training in a batch would be very slow, at least O (N ^ 2), but support for incremental training would allow for faster updates.
What are the best approaches?