I am trying to analyze the contents of a Drupal database for collective intelligence purposes.
So far, I have managed to develop a simple example that symbolizes various contents (mainly forum posts) and counts tokens after deleting stop words.
StandardTokenizerthat comes with Lucene should be able to tokenize host names and emails, but the content may also contain embedded html, for example:
Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.
It is weakly expressed in this way:
pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1
What I would like is to save links and remove html tags (e.g., <pre>or <strong>) that are useless.
? , ? - StandardTokenizerImpl , , Lucene ( ).
, - , .
EDIT:
StandardTokenizerImpl , , , lex flex .