Extract text between HTML tags

I have many HTML files from which I need to extract text. If all this is on one line, I can do it quite easily, but if the tag is wrapped around or on multiple lines, I can’t figure out how to do this. Here is what I mean:

<section id="MySection"> Some text here another line here <br> last line of text. </section> 

I am not worried about the text <br> unless it helps to wrap the text around. The area I want always starts with "MySection" and then ends with </section> . What I would like to get is something like this:

 Some text here another line here last line of text. 

I would prefer something like the vbscript option or the command line option (sed?), But I'm not sure where to start. Any help?

+4
source share
2 answers

Usually, the Internet Explorer COM object is used for this:

 root = "C:\base\dir" Set ie = CreateObject("InternetExplorer.Application") For Each f In fso.GetFolder(root).Files ie.Navigate "file:///" & f.Path While ie.Busy : WScript.Sleep 100 : Wend text = ie.document.getElementById("MySection").innerText WScript.Echo Replace(text, vbNewLine, "") Next 

However, the <section> not supported until IE 9, and even in IE 9 the COM object does not seem to process it correctly, since getElementById("MySection") returns only the opening tag:

 >>> wsh.echo ie.document.getelementbyid("MySection").outerhtml <SECTION id=MySection> 

Instead, you can use a regex:

 root = "C:\base\dir" Set fso = CreateObject("Scripting.FileSystemObject") Set re1 = New RegExp re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>" re1.Global = False re2.IgnoreCase = True Set re2 = New RegExp re2.Pattern = "(<br>|\s)+" re2.Global = True re2.IgnoreCase = True For Each f In fso.GetFolder(root).Files html = fso.OpenTextFile(filename).ReadAll Set m = re1.Execute(html) If m.Count > 0 Then text = Trim(re2.Replace(m.SubMatches(0).Value, " ")) End If WScript.Echo text Next 
+4
source

Here's a one-line solution using perl and an HTML parser from the Mojolicious framework:

 perl -MMojo::DOM -E ' say Mojo::DOM->new( do { undef $/; <> } )->at( q|#MySection| )->text ' index.html 

Assuming index.html with the following content:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body id="portada"> <section id="MySection"> Some text here another line here <br> last line of text. </section> </body> </html> 

This gives:

 Some text here another line here last line of text. 
+1
source

Source: https://habr.com/ru/post/1481566/


All Articles