Does Lucene support Unicode?

I am creating a full-text search object for my site encoded in asp.net mvc with a mysql database. This website is for non-English language. I started working on this using Lucense as a text search engine, but I can’t find out if it supports Unicode?

Does anyone have any info on whether Lucene supports Unicode? I do not want an unpleasant surprise.

Links to beginner articles on lucene.net implementation will also be appreciated.

+4
source share
3 answers

Yes. It fully supports unicode.
But for analysis, you must explicitly assign the appropriate stem cells and the correct stop words. As for the sample. Here is a copy of our latest project

directory = new RAMDirectory(); analyzer = new StandardAnalyzer(version, new Hashtable()); var indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); using (var session = sessionFactory.OpenStatelessSession()) { organizations = session.CreateCriteria(typeof(Organization)).List<Organization>(); foreach (var organization in organizations) { var document = new Document(); document.Add(new Field("Id", organization.ID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); document.Add(new Field("FullName", organization.FullName, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS)); document.Add(new Field("ObjectTypeInvariantName", typeof(Organization).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); indexWriter.AddDocument(document); } var persistentType = typeof(Order); var classMetadata = DbContext.SessionFactory.GetClassMetadata(persistentType); var properties = new List<PropertyInfo>(); for (int i = 0; i < classMetadata.PropertyTypes.Length; i++) { var propertyType = classMetadata.PropertyTypes[i]; if (propertyType.IsCollectionType || propertyType.IsEntityType) continue; properties.Add(typeof(Order).GetProperty(classMetadata.PropertyNames[i])); } orders = session.CreateCriteria(typeof(Order)).List<Order>(); var idProperty = typeof(Order).GetProperty(classMetadata.IdentifierPropertyName); foreach (var order in orders) { var document = new Document(); document.Add(new Field("Id", idProperty.GetValue(order, null).ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); document.Add(new Field("ObjectTypeInvariantName", typeof(Order).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); foreach (var property in properties) { var value = property.GetValue(order, null); if (value != null) { document.Add(new Field(property.Name, value.ToString(), Field.Store.NO, Field.Index.ANALYZED_NO_NORMS)); } } indexWriter.AddDocument(document); } indexWriter.Optimize(true); indexWriter.Commit(); return indexWriter.GetReader(); } 

I request organization objects from NHibernate and put them in Lucene.NET

Here is a simple search

 var searchValue = textEdit1.Text; var parser = new QueryParser(version, "FullName", analyzer); parser.SetLocale(new CultureInfo("ru-RU")); Query query = parser.Parse(searchValue); var indexSearcher = new IndexSearcher(directory, true); var docs = indexSearcher.Search(query, 10); lblSearchTotal.Text = string.Format(totalPattern, docs.totalHits, organizations.Count() + orders.Count); resultPanel.Controls.Clear(); foreach (var found in docs.scoreDocs) { var document = indexSearcher.Doc(found.doc); var objectId = document.Get("Id"); var objectType = document.Get("ObjectTypeInvariantName"); if (resultPanel.Controls.Count > 0) { var labelSeparator = CreateSeparatorLabelControl(); resultPanel.Controls.Add(labelSeparator); } var labelCard = CreateFoundLabelControl(); resultPanel.Controls.Add(labelCard); var organization = organizations.Where(o => o.ID.ToString() == objectId).FirstOrDefault(); if (organization != null) { labelCard.Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName); labelCard.Tag = organization; //labels[count].Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName); //labels[count].Visible = true; } else { labelCard.Text = string.Format("   '{0}'   '{1}'", objectType, objectId); labelCard.Tag = mainForm.GetObject(objectType, objectId); } labelCard.Visible = true; //count++; } 
+8
source

Yes, Lucene supports unicode because it stores strings in UTF-8 format.

http://lucene.apache.org/java/3_0_3/fileformats.html

Characters

Lucene writes Unicode character sequences as UTF-8 encoded bytes.

Line

Lucene writes strings as UTF-8 encoded bytes. First, the length in bytes is written as VInt, followed by the bytes.

String β†’ VInt, Chars

+5
source

Lucene supports unicode, but there are limitations. For example, some document readers do not support Unicode. In addition, lucene does things such as pluralizing or un-pluralize words. When you use a foreign language, some of them go away.

+2
source

Source: https://habr.com/ru/post/1334469/


All Articles