I would like to learn python and do some NLP, so finally it started working. I downloaded the English wikipedia mirror to create an excellent data set, and you already played a little at this stage, just getting some of them in sqlite db (havent worked with dbs in the past unfort).
But I guess sqlite is not the way to go for a full-blown nlp project (/ experiment :) - what things should I look at? HBase (.. and hadoop) seem interesting, I suppose I can run im im java, a prototype in python and maybe transfer really slow bits to java ... alternatively just run Mysql .. but the data set is 12 GB, I Interestingly, it will be a problem? Also looked at lucene, but not sure how (other than breaking wiki articles into pieces), I get this to work.
What comes to mind for a truly flexible NLP platform (in fact, I don’t know at this stage WHAT I want to do .. I just want to know the large-scale analysis of lang tbh)?
Many thanks.
source
share