How to use Freebase to label a very large unlabeled NLP dataset?

Question

How to use Freebase to label a very large unlabeled NLP dataset?

The dictionary I use:

nounfrase is a short phrase that refers to a specific person, place or idea. Examples of different names: Barack Obama, Obama, Bottle of Water, Yellowstone National Park, Google Chrome Web Browser, etc.

category - A semantic concept that defines which name phrases belong to it and which do not. Examples of categories include Politician, Housewares, Food, People, Sports Teams, etc. Thus, we would have what “Barack Obama” refers to “Politician” and “People”, but does not belong to “Food” or “Sports teams”.

I have a very unlabeled NLP dataset consisting of millions of nounfrases. I would like to use Freebase to refer to these nounfrases. I have a Freebase type mapping for my own categories. What I need to do is download all the examples for every single type of Freebase that I have.

The problem I am facing is that I need to figure out how to structure this type of request. At a high level, the query should request Freebase "what are all examples of XX theme?" and Freebase should answer "here is a list of all examples of topic XX." I would be very grateful if someone could give me the syntax of this request. If this can be done in Python, that would be awesome :)

+4

python nlp freebase

Malcolm Nov 11 '11 at 21:12

source share

2 answers

The common problem described here is called Entity Linking in natural language processing.

Imperturbable self-test:

See our book chapter on this topic for an introduction and approach to linking large-scale objects.

http://cs.jhu.edu/~delip/entity_linking.pdf

@deliprao

+1

Delip Nov 14 '11 at 19:59

source share

Tom morris · Accepted Answer · 2011-11-12T14:17:08+0000

The main request form (e.g. for a person) is

[{ "type":"/people/person", "name":None, "/common/topic/alias":[], "limit":100 }]

Documentation is available at http://wiki.freebase.com/wiki/MQL_Manual

Using freebase.mqlreaditer () from the Python library http://code.google.com/p/freebase-python/ is the easiest way to get through all of these. In this case, the "limit" clause determines the fragment size used for queries, but you will get each result separately at the API level.

By the way, how do you plan to discriminate against Jack Kennedy as president, from hurler, from soccer player, from books, etc. etc. http://www.freebase.com/search?limit=30&start=0&query=jack+kennedy You might want to consider getting more information from Freebase (birth and death dates, authors of books, other types assigned, etc.) if you have enough context to be able to use it to disambiguate.

Some time has passed, it may be easier and / or more efficient to work from bulk data dumps, not the API http://wiki.freebase.com/wiki/Data_dumps

Edit - here is a working Python program, assuming that you have a list of type identifiers in a file called "types.txt":

 import freebase f = file('types.txt') for t in f: t=t.strip() q = [{'type':t, 'mid':None, 'name':None, '/common/topic/alias':[], 'limit':500, }] for r in freebase.mqlreaditer(q): print '\t'.join([t,r['mid'],r['name']]+r['/common/topic/alias']) f.close()

If you make the request more complex, you probably want to lower the limit so that it does not work in timeouts, but for a simple request like this, raising the limit above the default value of 100 will make it more efficient using the query in large chunks.

How to use Freebase to label a very large unlabeled NLP dataset?

More articles: