I am trying to index Wikpedia dumps . My SAX parser only creates Article objects for XML with fields that concern me, and then submit to my ArticleSink, which creates Lucene documents.
I want to filter out special / meta pages, such as the prefix with Category:or Wikipedia:, so I made an array of these prefixes and tested the title of each page by that array in ArticleSink using article.getTitle.startsWith(prefix). In English, everything works fine, I get the Lucene index with all pages except the corresponding prefixes.
In French, prefixes without an accent also work (i.e., filter relevant pages), some of the accented prefixes don't work at all (for example Catégorie:), and some work most of the time, but don't work on some pages (for example, Wikipédia:), but I I do not see the difference between the corresponding lines (c less).
I can’t check all the differences in the file because of its size (5 GB), but it looks like the correct UTF-8 XML. If I take part of the file using grepor head, the accents are correct (even on the incriminated pages it <title>Catégorie:something</title>displays correctly grep). On the other hand, when I correct the XML file of the wiki / tail / cutting out the source file, the same page (here Catégorie:Rock par ville) is filtered in a small file, and not in the original ...
Any idea?
Alternatives I tried:
File receipt (commented out lines have been checked successfully *):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered Prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:",
"Catégorie:", "Modèle:", "Wikipédia:",
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"};
* ERRATUM
Actually, my bad one, the one I tried to work on, I checked the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8");
parser.parse(fis, handler);