Slovenian streamer for Sphinx

I am looking for an algorithm for the Slovenian language that I can use with Sphinx search.

What I'm trying to achieve, for example, when searching for "jabolka", I also need results for documents containing "jabolko", "jabolki", "jabolk", etc.

I found several links about the existence of the Slovenian stemmer, but I can not find where to download it, it is not even sold anywhere ...

Another option I came across is to use the wordforms option in the original Sphinx configuration ( http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms ), but building my own dictionary would be too complicated so I'm wondering if available public dictionaries are already available?


If a Slovenian streamer is not available, can anyone suggest a different approach to achieving similar search results?

+3
source share
2 answers

I managed to compile the Slovenian stem in the following steps:

  • Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for the snowball) and unzip it
  • Download the slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it in the unpacked project from step 1 in the / algorithms / slovene folder. The file name should be stem_ISO_8859_2.sbl
  • The algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you need to find utf char codes for slovenian special characters like ΔŒΕ Ε½Δ†)
  • Edit both .txt files in the / libstemmer folder and add entries for slovenian:

     slovene UTF_8,ISO_8859_2 slovene,sl,slv 
  • Modify / GNUmakefile and add slovene (once for the list of languages ​​for utf and once for ISO_8859_2_algorithms)
  • go to the / libstemmer folder and run:

     ./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak ./mkmodules.pl modules_utf8.h src_c modules_utf8.txt ../mkinc_utf8.mak 

    This will create the files needed for compilation later.

  • run make (from the root of the unpacked files)
  • If there were no errors during compilation, you should have the / src _c folder and the code for slovenian stemmer in them (next to others)

     stem_UTF_8_slovene.c stem_ISO_8859_2_slovene.c ... 
  • Unzip the last sphinxes and copy all the files from the snowball project to the sphinx / libstemmer_c folder (excluding libstemmer.o and GNUmakefile )

  • compile sphinx:

     touch NEWS README AUTHORS ChangeLog autoreconf --force --install ./configure --with-libstemmer make make install 
  • If everything went well, you should have an slovene stemmer for sphinx to work, you just need to include it in the sphinx index configuratiun configuration (on my Debian it is in /usr/local/etc/sphinx.conf):

     charset_type = utf-8 morphology = libstemmer_slovene 

Hope this helps someone, I haven't had any previous experience with autoconf , so it took me a while to figure this out.

This slovene stemmer is not officially released at http://snowball.tartarus.org , but from my tests it works well enough for my project.

+3
source

I'm not sure if this will do what you want, but I came across this link to a tool called spelldump in the Sphinx documentation:

spelldump is one of the helper tools inside the Sphinx package.

Used to extract the contents of a dictionary file that uses ispell or MySpell, which can help create word lists for wordforms - all possible forms are pre-created for you.

http://sphinxsearch.com/docs/current.html#ref-spelldump

This requires a "dictionary file using ispell or MySpell" - I found a link to a dictionary dictionary dictionary in Slovenia , which may be suitable.

Good luck

+1
source

Source: https://habr.com/ru/post/1441137/


All Articles