Stripping HTML from a string

I tried a few things, but nothing works fine. I have an access database and I am writing code in VBA. I have a line of HTML source code and I am interested in removing all HTML code and tags due to the fact that I just have a text line without HTML or tags. What is the best way to do this?

thanks

+5
source share
6 answers

One of the ways that is most resistant to poor markup;

with createobject("htmlfile") .open .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>" .close msgbox "text=" & .body.outerText end with 
+8
source
  Function StripHTML(cell As Range) As String Dim RegEx As Object Set RegEx = CreateObject("vbscript.regexp") Dim sInput As String Dim sOut As String sInput = cell.Text With RegEx .Global = True .IgnoreCase = True .MultiLine = True .Pattern = "<[^>]+>" 'Regular Expression for HTML Tags. End With sOut = RegEx.Replace(sInput, "") StripHTML = sOut Set RegEx = Nothing End Function 

It can help you, good luck.

+5
source

It depends on how complex the html structure is and how much data you want from it.

Depending on the complexity, you can get rid of regular expressions, but for complex markup, trying to parse data from html with a regular expression is like trying to eat soup with a fork.

You can use the htmFile object to turn a flat file into objects that you can interact with, for example:

 Function ParseATable(url As String) As Variant Dim htm As Object, table As Object Dim data() As String, x As Long, y As Long Set htm = CreateObject("HTMLfile") With CreateObject("MSXML2.XMLHTTP") .Open "GET", url, False .send htm.body.innerhtml = .responsetext End With With htm Set table = .getelementsbytagname("table")(0) Redim data(1 To table.Rows.Length, 1 To 10) For x = 0 To table.Rows.Length - 1 For y = 0 To table.Rows(x).Cells.Length - 1 data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText Next y Next x ParseATable = data End With End Function 
+3
source

Using early binding:

 Public Function GetText(inputHtml As String) As String With New HTMLDocument .Open .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>" .Close StripHtml = .body.outerText End With End Function 
0
source

An improvement over one of the above ... It finds quotation marks and lines and replaces them with equivalents other than HTML. In addition, the original function had a problem with inline UNC links (for example: <\ server \ share \ folder \ file.ext>). It will delete the entire UNC line due to <at the beginning and> at the end. This function captures that UNC is correctly inserted into the string:

 Function StripHTML(strString As String) As String Dim RegEx As Object Set RegEx = CreateObject("vbscript.regexp") Dim sInput As String Dim sOut As String sInput = Replace(strString, "<\\", "\\") With RegEx .Global = True .IgnoreCase = True .MultiLine = True .Pattern = "<[^>]+>" 'Regular Expression for HTML Tags. End With sOut = RegEx.Replace(sInput, "") StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, - 1), "&quot;", "'", 1, -1), "\\", "<\\", 1, -1) Set RegEx = Nothing End Function 
0
source

I found very simple solutions. I am currently starting an access database and using excel forms to update the system due to system restrictions and shared disk privileges. when I call data from Access I, I use: Plaintext ( YourStringHere ) will delete all html parts and leave the text.

hope it works.

0
source

Source: https://habr.com/ru/post/1438695/


All Articles