The right way to create Tokenizer in Lucene

I am trying to analyze the contents of a Drupal database for collective intelligence purposes.

So far, I have managed to develop a simple example that symbolizes various contents (mainly forum posts) and counts tokens after deleting stop words.

StandardTokenizerthat comes with Lucene should be able to tokenize host names and emails, but the content may also contain embedded html, for example:

Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.

It is weakly expressed in this way:

pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1

What I would like is to save links and remove html tags (e.g., <pre>or <strong>) that are useless.

? , ? - StandardTokenizerImpl , , Lucene ( ).

, - , .

EDIT: StandardTokenizerImpl , , , lex flex .

+3
2

, , lucene tokenize. html, Jericho, html, , , , . Jericho TextExtractor .

String text = "Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi"
    +"Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete"
    +"scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.";

TextExtractor te = new TextExtractor(new Source(text)){
    @Override
    public boolean excludeElement(StartTag startTag) {
        return startTag.getName() != HTMLElementName.A;
    }
};
System.out.println(te.toString());

:

Pubblichiamo la presentazione di IBM riguardante DB2 per vari sistemi operativiLinux, UNIX e Windows. Questo documento sta sulla piattaforma KM e lo potetescaricare .

Tokenizer Lucene html-, - Jericho . html- lucene, , , , . , , , , , , .

+2

, , HTML Lucene, HTML- , , .

. jGuru: HTML-? , , .

+2

Source: https://habr.com/ru/post/1724398/


All Articles