How to check if xml file contains sequential nodes?

I have an xml file similar to

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!--<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">-->
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>Springer</publisher-name>
</publisher>
</journal-meta>
<fig-count count="0" />
<table-count count="0" />
<equation-count count="0" />
</front>
<body>
<sec id="s1">
<label>a.</label>
<title>INTRODUCTION</title>
<p>One of approaches of solving<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref> the problem <xref ref-type="bibr" rid="ref1">[1]</xref>, <xref ref-type="bibr" rid="ref5">[2]</xref>, <xref ref-type="bibr" rid="ref6">[6]</xref> <xref ref-type="bibr" rid="ref7">[6]</xref> of light propagation in scattering media is the method of Monte Carlo statistical simulation<sup><xref ref-type="bibr" rid="c1">1</xref><xref ref-type="bibr" rid="c5">5</xref></sup>. It is a set of techniques that allow us to find the necessary solutions by repetitive random sampling. Estimates of the unknown quantities are statistical means.</p>
<p>For the case of radiation transport in scattering <xref ref-type="bibr" rid="ref6">6</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">9</xref> <xref ref-type="bibr" rid="ref10">10</xref> medium Monte Carlo method consists in repeated calculation of the trajectory <xref ref-type="bibr" rid="ref7">6</xref> <xref ref-type="bibr" rid="ref7">7</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">[9]</xref> of a photon in a medium based on defined environment parameters. Application of Monte Carlo method is based on the use of macroscopic optical properties of the medium which are considered homogeneous within small volumes of tissue. Models that are based on this method can be divided into two types: models that take into account the polarization of the radiation, and models that ignore it.</p>
<p>Simulation that is based on the previous models usually discards the details of the radiation energy distribution within a single scattering particle. This disadvantage can be ruled out (in the case of scattering particles whose size exceeds the wavelength) by using another method - reverse ray tracing. This method is like the one mentioned before on is based on passing a large number of photons through a medium that is simulated. The difference is that now each scattering particle has a certain geometric topology and scattering is now calculated using the Fresnel equations. The disadvantage of this method is that it can give reliable results only if the particle size is much greater than the wavelength (at least an order of magnitude).</p>
</sec>
</body>
</article>

in which there are communication nodes in the form <xref ref-type="bibr" rid="ref...">...</xref>. How can I find if there are 3 or more consecutive node nodes (separated by a comma and a space or just a space in the file and outputting them to a txt file.

I can do a regular expression search, for example (?:<xref type="bibr" rid="ref\d+">\[\d+\]</xref>\s*,\s*){2,}<xref type="bibr" rid="ref\d+">\[\d+\]</xref>, that finds 3 or more node nodes separated by ", SPACE" or "SPACE", but they do not have to have a sequential identifier. How to do it?

+4
source share
3 answers

, , . . .. . .

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;


public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load("article.xml");

    //only selects <p> that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    List<string> results = new List<string>();

    //Foreach <p>
    foreach (XmlNode x in nodes)
    {
        XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
        string temp = x.OuterXml; //the paragraph we're checking

        //finds start and end of each tag xref tag
        foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
            StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
            startEndOfEachTag.Add(se);  
        }

        /* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

        //we go through all the xrefs
        for (int i=0; i<xrefs.Count; i++)
        {
            int newIterator = i; //This iterator prevents us from creating duplicates.
            string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

            int count = 1; //we got one xref to start with we need at least 3
            string tempRes = ""; //the string we store the result in

            int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

            for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
            {
                if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
                {
                    consecutive++;
                }
                else { break; }

                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space

                Match matchReg;

                try
                {
                    matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
                        regCompare); //we get the result
                }
                catch
                {
                    i = j; // we failed and i should start from here now.
                    break;
                }

                if (matchReg.Success){
                    count++; //it was a success so we increment the number of xrefs we matched
                    tempRes = matchReg.Value; // we add it to out temporary result.
                    newIterator = j; //update where i should start from next time.
                }
                else {
                    i = j; // we failed and i should start from here now.
                    break;
                }
            }
            i = newIterator;
            if (count > 2)
            {
                results.Add(tempRes); 
            }
        }
    }
    Console.WriteLine("Results: ");
    foreach(string s in results)
    {
            Console.WriteLine(s+"\n");
    }

    Console.ReadKey();
}

class StartEnd
{
    public int start=-1;
    public int end = -1;

    public StartEnd(int start, int end)
    {
        this.start = start;
        this.end = end;
    }
}
+2

xpath . , xpath, , . xpath 3 node, bibr rid, ref. , . .

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]/parent::*");

    foreach(XmlNode x in nodes)
    {
        XmlNodeList temp = x.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        //we only select those that have 3 or more references.
        if (temp.Count >= 3)
        {
            Console.WriteLine(x.InnerText);
        }
    }

    Console.ReadKey();

}

, xpath, , .

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    foreach(XmlNode x in nodes){
        Console.WriteLine(x.InnerText);
    }

    Console.ReadKey();

}
+1

. # XML xref, "," "".

  static void Main(string[] args)
  {
     using (var xmlStream = System.Reflection.Assembly.GetExecutingAssembly().GetManifestResourceStream("ConsoleApp1.XMLFile1.xml"))
     {
        int state = 0; // 0 = Look for xref; 1 = look for separator
        string[] simpleSeparators = { " ", ", " };
        string rid = "0";
        StringBuilder nodeText = new StringBuilder();
        string[] consecutiveNodes = new string[3];

        System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings();
        settings.DtdProcessing = System.Xml.DtdProcessing.Ignore;
        using (var reader = System.Xml.XmlReader.Create(xmlStream, settings))
        {
           while (reader.Read())
           {
              if (reader.IsStartElement("xref"))
              {
                 nodeText.Append("<xref");
                 if (reader.HasAttributes)
                 {
                    while (reader.MoveToNextAttribute())
                       nodeText.AppendFormat(" {0}=\"{1}\"", reader.Name, reader.Value);
                 }
                 nodeText.Append(">");
                 string nextRid = reader.GetAttribute("rid");
                 switch (state)
                 {
                    case 0:
                       break;
                    case 2:
                    case 4:
                       if (Math.Abs(GetIndex(nextRid) - GetIndex(rid)) > 1)
                          state = 0;
                       break;
                 }
                 state++;
                 rid = nextRid;
              }
              else if (reader.NodeType == System.Xml.XmlNodeType.Text)
              {
                 if (state > 0)
                    nodeText.Append(reader.Value);
                 if ((state % 2 == 1) && simpleSeparators.Contains(reader.Value))
                       state++;
              }
              else if ((reader.NodeType == System.Xml.XmlNodeType.EndElement) && (state > 0))
              {
                 nodeText.AppendFormat("</{0}>", reader.Name);
                 consecutiveNodes[state / 2] = nodeText.ToString();
                 nodeText.Clear();
                 if (state > 3)
                 {
                    Console.WriteLine("{0}{1}{2}", consecutiveNodes[0], consecutiveNodes[1], consecutiveNodes[2]);
                    state = 0;
                 }
              }
              else if (reader.IsStartElement())
              {
                 nodeText.Clear();
                 state = 0;
              }
           }
        }
     }
  }

  static int GetIndex(string rid)
  {
     int start = rid.Length;
     while ((start > 0) && Char.IsDigit(rid, --start)) ;

     start++;
     if (start < rid.Length)
        return int.Parse(rid.Substring(start));
     return 0;
  }

:

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>
<xref ref-type="bibr" rid="rid6">6</xref><xref ref-type="bibr" rid="rid6">9</xref><xref ref-type="bibr" rid="rid6">10</xref>

:

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref>

ref11, ref13 ref8 , .

+1

Source: https://habr.com/ru/post/1689532/


All Articles