NLP Launch - Big Python Dataset +

I would like to learn python and do some NLP, so finally it started working. I downloaded the English wikipedia mirror to create an excellent data set, and you already played a little at this stage, just getting some of them in sqlite db (havent worked with dbs in the past unfort).

But I guess sqlite is not the way to go for a full-blown nlp project (/ experiment :) - what things should I look at? HBase (.. and hadoop) seem interesting, I suppose I can run im im java, a prototype in python and maybe transfer really slow bits to java ... alternatively just run Mysql .. but the data set is 12 GB, I Interestingly, it will be a problem? Also looked at lucene, but not sure how (other than breaking wiki articles into pieces), I get this to work.

What comes to mind for a truly flexible NLP platform (in fact, I don’t know at this stage WHAT I want to do .. I just want to know the large-scale analysis of lang tbh)?

Many thanks.

+3
source share
5

NLTK - , ( Python-based - , ... , ). sqlite - SQL, postgresql.

+4

PyCon 2010 " : NLTK Dumbo" . , .
, sqlite - 12G. , , sqlite, , , .

+1

, Vector Space Model anlaysis.

, . .

Apache Lucene, python Java Lucene. Elasticsearch , Apache Lucene python. Elasticsearch API REST.

Postgresql . article, .

, Lucene/Elasticsearch .

.

+1

:

Spacy - (NLP) Python, , . Gensim - Python, Word2Vec , , .

.

Standford NLP [Python Framework] 50+. . , 4

Spacy. Spacy gensim/ API, . , Spacy -, .

- [ ]. NLP: Spacy , , . Spacy architecture

, .

0
source

Cortecx is the new NLP library and is pretty easy to implement, so this is not a problem for beginners. I pretty much built this for this purpose and would like some feedback. It can do all kinds of things like POS, chunking, NER and even comes with a dictionary, thesaurus and built-in word embeddings, so check this out:

Here is the Cortecx website and documentation: https://www.cerybra.com/

0
source

Source: https://habr.com/ru/post/1739710/


All Articles