I want to extract the name of a US patent from a URL like
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OSOS = PN / 6293874
(Update: as indicated in the comments, the name of the patent is not marked as “Title,” however, it appears by itself above the “Abstract” on a web page by itself.) In most cases, it is located in the 7th child element of the body or the third "font" in the document, but sometimes the message "** appears at the top of the page ** Please see the images for: (Certificate of Correction) ** or" (certificate of revision) "both extraction methods will deteriorate by inserting another child element of the" body " and three additional "font" elements before you hit in the headline.
However, the name appears sequentially as the first "font" element with the "size" attribute set to "+1". Unfortunately, other elements have size = "- 1", including the aforementioned elements, which are not always present, so it must be specifically with this attribute and value. I searched, but I can’t figure out how to get elements by attribute and value. Here is my code:
Function Test_UpdateTitle(url As String) Dim title As String Dim pageSource As String Dim xml_obj As XMLHTTP60 Set xml_obj = CreateObject("MSXML2.XMLHTTP") xml_obj.Open "GET", url, False xml_obj.send pageSource = xml_obj.responseText Set xml_obj = Nothing Dim html_doc As HTMLDocument Set html_doc = CreateObject("HTMLFile") html_doc.body.innerHTML = pageSource Dim fontElement As IHTMLElement 'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate 'Method 1 ' Dim body As IHTMLElement ' Set body = html_doc.getElementsByTagName("body").Item(0) ' Set fontElement = body.Children(6) 'Method 2 ' Set fontElement = html_doc.getElementsByTagName("font").Item(3) 'Method 3 Dim n As Integer For n = 3 To html_doc.getElementsByTagName("font").Length - 1 Set fontElement = html_doc.getElementsByTagName("font").Item(n) If InStr(fontElement.innerText, "Please see") = 0 And _ InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _ InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _ InStr(fontElement.innerText, " **") = 0 Then Test_UpdateTitle = fontElement.innerText Exit Function End If Next n End Function
I have to add that "**" does not work to skip the last <b> **</b> element, and I get "**" as the title where there is a notification to see the images. Is an asterisk a wildcard in this context?
source share