Using UIMA to extract text from an XML file

Question

Using UIMA to extract text from an XML file

I am creating a text extractor for XML using UIMA. Since I am new to the UIMA framework, I want to know how to do this.

I understand that UIMA can annotate specific parts of a file, but how can I extract information efficiently? Any help is appreciated.

Thanks Jatin

+4

xml uima

jatinpreet Apr 3 '14 at 11:40

source share

2 answers

Here's a picker to get you started:

import static com.google.common.base.Preconditions.checkArgument;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;

import org.apache.uima.UimaContext;
import org.apache.uima.collection.CollectionException;
import org.apache.uima.fit.descriptor.TypeCapability;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.Text;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;



@TypeCapability(outputs = "xxx")
public class XmlCollectionReader extends JCasCollectionReader_ImplBase {
    private static Logger LOG = LoggerFactory.getLogger(XmlCollectionReader.class);

    private SAXBuilder builder;
    private XMLOutputter xo;
    private XPathExpression<Object> sentenceXPath;

    @Override
    public void initialize(UimaContext context) throws ResourceInitializationException {
        super.initialize(context);
        try {
            File corpusDir = new File(inputDir);
            checkArgument(corpusDir.exists());
            fileIterator = DirectoryIterator.get(directoryIterator, corpusDir, "xml", false);
            builder = new SAXBuilder();
            xo = new XMLOutputter();
            xo.setFormat(Format.getRawFormat());
            sentenceXPath = XPathFactory.instance().compile("//S");
        } catch (Exception e) {
            throw new ResourceInitializationException(
                    ResourceInitializationException.NO_RESOURCE_FOR_PARAMETERS,
                    new Object[] { inputDir });
        }
    }

    public void getNext(JCas jcas) throws IOException, CollectionException {

        File file = fileIterator.next();
        try {
            LOG.debug("reading {}", file.getName());
            Document doc = builder.build(new FileInputStream(file));
            Element rootNode = doc.getRootElement();

            String title = xo.outputString(rootNode.getChild("Title").getContent());

            for (Object sentence : sentenceXPath.evaluate(rootNode)) {
                Element sentenceE = (Element) sentence;
                    ...
                }
            }

            jcas.setDocumentText(...);

        } catch (JDOMException e) {
            throw new CollectionException(e);
        }
    }
}

+1

Renaud Apr 30 '14 at 15:10

source share

Peter Kluegl · Accepted Answer · 2014-04-04T09:07:29+0000

In the limited perspective of a UIMA Ruta developer , I use the HtmlAnnotator UIMA Ruta for these use cases. This is certainly not the most effective approach. The analysis engine will not use separate types for elements, since it knows only the most common html tags, but if necessary, I convert to a predefined type system in UIMA Ruta. The backend uses htmlparser .

Using UIMA to extract text from an XML file

More articles: