How to find / extract HTML font element with size = "+ 1" attribute using Excel VBA

I want to extract the name of a US patent from a URL like

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OSOS = PN / 6293874

(Update: as indicated in the comments, the name of the patent is not marked as “Title,” however, it appears by itself above the “Abstract” on a web page by itself.) In most cases, it is located in the 7th child element of the body or the third "font" in the document, but sometimes the message "** appears at the top of the page ** Please see the images for: (Certificate of Correction) ** or" (certificate of revision) "both extraction methods will deteriorate by inserting another child element of the" body " and three additional "font" elements before you hit in the headline.

However, the name appears sequentially as the first "font" element with the "size" attribute set to "+1". Unfortunately, other elements have size = "- 1", including the aforementioned elements, which are not always present, so it must be specifically with this attribute and value. I searched, but I can’t figure out how to get elements by attribute and value. Here is my code:

Function Test_UpdateTitle(url As String) Dim title As String Dim pageSource As String Dim xml_obj As XMLHTTP60 Set xml_obj = CreateObject("MSXML2.XMLHTTP") xml_obj.Open "GET", url, False xml_obj.send pageSource = xml_obj.responseText Set xml_obj = Nothing Dim html_doc As HTMLDocument Set html_doc = CreateObject("HTMLFile") html_doc.body.innerHTML = pageSource Dim fontElement As IHTMLElement 'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate 'Method 1 ' Dim body As IHTMLElement ' Set body = html_doc.getElementsByTagName("body").Item(0) ' Set fontElement = body.Children(6) 'Method 2 ' Set fontElement = html_doc.getElementsByTagName("font").Item(3) 'Method 3 Dim n As Integer For n = 3 To html_doc.getElementsByTagName("font").Length - 1 Set fontElement = html_doc.getElementsByTagName("font").Item(n) If InStr(fontElement.innerText, "Please see") = 0 And _ InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _ InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _ InStr(fontElement.innerText, " **") = 0 Then Test_UpdateTitle = fontElement.innerText Exit Function End If Next n End Function 

I have to add that "**" does not work to skip the last <b> **</b> element, and I get "**" as the title where there is a notification to see the images. Is an asterisk a wildcard in this context?

+5
source share
3 answers

You can try this. So far, his first font tag with a size attribute and a value of "+1" should work. I tested only 3 pages, but they all returned the correct results.

 Function Test_UpdateTitle(url) title = "Title Not Found!" Set xml_obj = CreateObject("MSXML2.XMLHTTP") xml_obj.Open "GET", url, False xml_obj.send pageSource = xml_obj.responseText Set xml_obj = Nothing Set document = CreateObject("HTMLFile") document.write pageSource For i = 0 To document.getElementsByTagName("font").length - 1 If document.getElementsByTagName("font")(i).size = "+1" Then title = document.getElementsByTagName("font")(i).innerText Exit For End If Next Test_UpdateTitle = title End Function MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874") MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=fight.TI.&OS=TTL/fight&RS=TTL/fight") MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=14&f=G&l=50&d=PTXT&p=1&S1=search&OS=search&RS=search") 
+1
source

This answer is somewhat incomplete because my Excel will not execute the following lines:

 Dim xml_obj As XMLHTTP60 Set xml_obj = CreateObject("MSXML2.XMLHTTP") 

But I think this may be the preferred approach.

Instead of using the USPTO website, how about using Google?

Click this url: http://www.google.com/patents/US6293874

Please note that the patent number is provided in this URL.

Then in your function just pull this tag called invention-title .

 Set titleElement = html_doc.getElementsByTagName("invention-title").Item(0) title = titleElement.innerText MsgBox(title) 

If you check the source on this page, there is only one of them.

If you are open to this alternative approach, it would be relatively easy to parse the patent numbers from your URLs, and I think extracting the invention-title would be much more reliable.

+1
source

See if this answer works as intended. Make sure your book has links to the following libraries:

Microsoft XML, v6.0 Microsoft HTML Object Library

Libraries used

If you don’t know how to add them to Excel, just give a link to this link Link to add a link

 Option Explicit Sub Test() Debug.Print Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874") End Sub Function Test_UpdateTitle(ByVal strURL As String) As String Dim oHTTP As MSXML2.XMLHTTP60 Dim oDoc As MSHTML.HTMLDocument Dim oFontTags As Variant Dim oFontTag As HTMLFontElement Dim strInnerText As String Dim strSize As String ' Create the http object and send it. Set oHTTP = New MSXML2.XMLHTTP60 oHTTP.Open "GET", strURL, False oHTTP.send ' Make sure that get the a reponse back If oHTTP.Status = 200 Then Set oDoc = New HTMLDocument oDoc.body.innerHTML = oHTTP.responseText Set oFontTags = oDoc.getElementsByTagName("font") ' Go through all the tags. For Each oFontTag In oFontTags 'Get the inner text and size of each tag. strInnerText = oFontTag.innerText strSize = oFontTag.getAttributeNode("size").Value 'Compare to make sure you have what needed If InStr(strInnertText, "Please see") = 0 And _ InStr(strInnertText, "( Certificate of Correction )") = 0 And _ InStr(strInnertText, "( Reexamination Certificate )") = 0 And _ InStr(strInnertText, " **") = 0 Then If strSize = "+1" Then Test_UpdateTitle = strInnerText Exit Function End If End If Next oFontTag End If End Function 

Hope this helps. :)

+1
source

Source: https://habr.com/ru/post/1234731/


All Articles