Slovenian streamer for Sphinx

Question

Slovenian streamer for Sphinx

I am looking for an algorithm for the Slovenian language that I can use with Sphinx search.

What I'm trying to achieve, for example, when searching for "jabolka", I also need results for documents containing "jabolko", "jabolki", "jabolk", etc.

I found several links about the existence of the Slovenian stemmer, but I can not find where to download it, it is not even sold anywhere ...

Another option I came across is to use the wordforms option in the original Sphinx configuration ( http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms ), but building my own dictionary would be too complicated so I'm wondering if available public dictionaries are already available?

If a Slovenian streamer is not available, can anyone suggest a different approach to achieving similar search results?

+3

php search full-text-search sphinx stemming

Kovinet Jan 03 '12 at 14:50

source share

2 answers

I'm not sure if this will do what you want, but I came across this link to a tool called spelldump in the Sphinx documentation:

spelldump is one of the helper tools inside the Sphinx package.
Used to extract the contents of a dictionary file that uses ispell or MySpell, which can help create word lists for wordforms - all possible forms are pre-created for you.
http://sphinxsearch.com/docs/current.html#ref-spelldump

This requires a "dictionary file using ispell or MySpell" - I found a link to a dictionary dictionary dictionary in Slovenia , which may be suitable.

Good luck

+1

Colin pickard Jan 11 '12 at 17:34

source share

Kovinet · Accepted Answer · 2012-03-05T14:04:53+0000

I managed to compile the Slovenian stem in the following steps:

Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for the snowball) and unzip it
Download the slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it in the unpacked project from step 1 in the / algorithms / slovene folder. The file name should be stem_ISO_8859_2.sbl
The algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you need to find utf char codes for slovenian special characters like ČŠŽĆ)
Edit both .txt files in the / libstemmer folder and add entries for slovenian:
```
 slovene UTF_8,ISO_8859_2 slovene,sl,slv 
```
Modify / GNUmakefile and add slovene (once for the list of languages for utf and once for ISO_8859_2_algorithms)

go to the / libstemmer folder and run:

 ./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak ./mkmodules.pl modules_utf8.h src_c modules_utf8.txt ../mkinc_utf8.mak

This will create the files needed for compilation later.

run make (from the root of the unpacked files)
If there were no errors during compilation, you should have the / src _c folder and the code for slovenian stemmer in them (next to others)
```
 stem_UTF_8_slovene.c stem_ISO_8859_2_slovene.c ... 
```
Unzip the last sphinxes and copy all the files from the snowball project to the sphinx / libstemmer_c folder (excluding libstemmer.o and GNUmakefile )

compile sphinx:

 touch NEWS README AUTHORS ChangeLog autoreconf --force --install ./configure --with-libstemmer make make install

If everything went well, you should have an slovene stemmer for sphinx to work, you just need to include it in the sphinx index configuratiun configuration (on my Debian it is in /usr/local/etc/sphinx.conf):
```
 charset_type = utf-8 morphology = libstemmer_slovene 
```

Hope this helps someone, I haven't had any previous experience with autoconf , so it took me a while to figure this out.

This slovene stemmer is not officially released at http://snowball.tartarus.org , but from my tests it works well enough for my project.

Slovenian streamer for Sphinx

More articles: