Regex, I need to improve the method to get src and alt from the image

There is no problem to get src or alt separately, but how can you get both at the same time as the group name.

We should keep in mind that alt can be left or right of src.

I'm in a hurry, so I found a quick solution by creating 3 group names for src, and for alt. I know that we can do it much better.

private void GetFirstImage(string newHtml, out string imgstring, out string imgalt)
{
    imgalt = "";
    imgstring = "";

    string pattern = "(?<=<img(?<name1>\\s+[^>]*?)src=(?<q>['\"]))(?<url>.+?)(?=\\k<q>)(?<name2>.+?)\\s*\\>";

    try
    {
        //si hay imagen
        if (Regex.IsMatch(newHtml, pattern))
        {
            Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

            imgstring = r.Match(newHtml).Result("${url}");
            string tempalt = "", tempalt2;
            tempalt = r.Match(newHtml).Result("${name1}");
            tempalt2 = r.Match(newHtml).Result("${name2}");

            //ya tenemos la ruta de la imagen y de lo que aparece a izq y a derecha dentro de <img>

            try
            {
                pattern = "alt=(?<q>['\"])(?<alt>.+?)(?=\\k<q>)";

                //si hay algo que no sea vacío a la izquierda de la ruta
                if(!String.IsNullOrEmpty(tempalt.Trim()))
                {
                    r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

                    //si cumple con el pattern para buscar el alt
                    if (Regex.IsMatch(tempalt, pattern))
                    {

                        imgalt = r.Match(tempalt).Result("${alt}");

                    }
                }
                //si no se encontró el alt y hay algo a la derecha
                if(String.IsNullOrEmpty(imgalt) && !String.IsNullOrEmpty(tempalt2.Trim()))
                {

                    r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

                    //si se cumple el patrón del alt
                    if (Regex.IsMatch(tempalt2, pattern))
                    {

                        imgalt = r.Match(tempalt2).Result("${alt}");

                    }

                }

            }
            catch{ }

        }

    }
    catch{}

}
+3
source share
1 answer

Simple ... do not use regex. Use the DOM - so parser XmlDocumentfor xhtml or the HTML Agility Pack for (non-x) html.

Then just query root.SelectNodes("//img")and look at the "src" and "alt" attributes for each element (i.e. node.Attributes["src"].Value, etc.)

Regex html ( ).

+5

Source: https://habr.com/ru/post/1768053/


All Articles