Crawling specific information from a URL in Python

The easiest way to crawl HTML tables is to use pandas.read_html(url). For the following URL, I get all its tables

import pandas as pd
url="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=3944788.PN.&OS=PN/3944788&RS=PN/3944788"
df=pd.read_html(url)

From the above URL, I just want this specific information.

Current U.S. Class: 235/54F

Given the above dfas a list, I wrote the following code to get this specific information

myitem="Current U.S. Class:"
for i in range(len(df)):
    if myitem in str(df[i]):
        ClassTitle=''.join(df[i][0])
        ClassNumber=''.join(df[i][1])

if ';' in ClassTitle:
    ClassTitle=ClassTitle.rsplit(':')
    print(ClassTitle[0])
if ';' in ClassNumber:
    ClassNumber=ClassNumber.rsplit(';')
if ',' in ClassTitle:
    ClassTitle=ClassTitle.rsplit(',')
    print(ClassTitle[0])
if ',' in ClassNumber:
    ClassNumber=ClassNumber.rsplit(',')

But it sometimes is excellent for a URL-addresses, and sometimes also includes other information about the class, and Current CPC Classand Current International Class. I also tried BeautifulSoapusing a function View Page Source, but I am confused to mention the class.

+4
source share
1 answer

Using BeautifulSoup

import requests
from bs4 import BeautifulSoup

r = requests.get('http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=3944788.PN.&OS=PN/3944788&RS=PN/3944788')
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')[4]
result = table.find('tr').text
print(result)
# Current U.S. Class: 235/54F 

Explanation

, , - . find_all('table') . , find_all('table')[4] 5- .

, , tr. table.find('tr') tr, .

, .text .

+2

Source: https://habr.com/ru/post/1694961/


All Articles